Update README.md

4 years ago · 99089254e4
parent b54b204074
commit 99089254e4
1 changed files with 2 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -83,7 +83,7 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge

 kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.

-RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
+**RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.

 ### RWKV v2+ improvements (not yet uploaded to github. used in the latest 1.5B run)

@ -103,7 +103,7 @@ x = x + self.att(self.ln1(x))
 x = x + self.ffn(self.ln2(x))
 ```

-I need a better CUDA kernel to (1) pull off maxK so there's need to clamp k to 60. (2) fix divide-by-zero without using K_EPS. (3) support bf16/fp16. Please let me know if you are a CUDA expert :)
+I need a better CUDA kernel to (1) pull off maxK so there's need to clamp k to 60. (2) fix divide-by-zero without using K_EPS. (3) support bf16/fp16. **Please let me know if you are a CUDA expert :)**

 Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K.