Update README.md

4 years ago · a4b0759bf6
parent 5bd56f1f2d
commit a4b0759bf6
1 changed files with 2 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -43,6 +43,8 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai

 How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

+**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
+
 ### Inference

 Check https://github.com/BlinkDL/RWKV-v2-RNN-Pile for L24-D1024 and L12-D768 models trained on the Pile (and the latest code). It's very fast on CPU (the default mode).
@ -91,8 +93,6 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge

 kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.

-**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
-
 The R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token.

 ## RWKV-3 improvements (used in the latest 1.5B run)