diff --git a/README.md b/README.md index 9788634..b3e75e8 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel. -It's also using my SmallInitEmb trick https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers). +It's also using my SmallInitEmb trick https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers), and a custom CUDA kernel https://github.com/BlinkDL/RWKV-CUDA . The pseudocode (execution from top to bottom):