Update README.md

main
PENG Bo 3 years ago committed by GitHub
parent 5bd56f1f2d
commit a4b0759bf6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -43,6 +43,8 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai
How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.
**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
### Inference
Check https://github.com/BlinkDL/RWKV-v2-RNN-Pile for L24-D1024 and L12-D768 models trained on the Pile (and the latest code). It's very fast on CPU (the default mode).
@ -91,8 +93,6 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge
kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.
**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
The R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token.
## RWKV-3 improvements (used in the latest 1.5B run)

Loading…
Cancel
Save