From a4b0759bf654ddcc8b64ddf73808cd9e6bf0bff2 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Sat, 16 Jul 2022 06:44:50 +0800 Subject: [PATCH] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ac6896d..ffca1de 100644 --- a/README.md +++ b/README.md @@ -43,6 +43,8 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it. +**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. + ### Inference Check https://github.com/BlinkDL/RWKV-v2-RNN-Pile for L24-D1024 and L12-D768 models trained on the Pile (and the latest code). It's very fast on CPU (the default mode). @@ -91,8 +93,6 @@ Write out the formulas for "token at pos 2" and "token at pos 3" and you will ge kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel. -**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. - The R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token. ## RWKV-3 improvements (used in the latest 1.5B run)