@ -16,7 +16,7 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge
kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.
kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.
It's also using my SmallInitEmb trick https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers).
It's also using my SmallInitEmb trick https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers), and a custom CUDA kernel https://github.com/BlinkDL/RWKV-CUDA .