1. Now time decay is like 0.999^T (0.999 is learnable). Change it to something like (0.999^T + 0.1) where 0.1 is learnable too. The 0.1 part will be kept forever.
@ -198,6 +198,10 @@ out.write(ss + "\n")
3. Inject some trainable and interpolable positional encoding?
4. Aside from 2d rotation, we can try other Lie groups such as 3d rotation ( SO(3) ). Non-abelian RWKV lol.
5. RWKV might be great on analog devices (search for Analog Matrix-vector multiplication & Photonic Matrix-vector multiplication).
### Misc
I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example: