From 4c6db5607c6f94c38c10004efb292510bc71ba59 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Mon, 9 Aug 2021 14:31:02 +0800 Subject: [PATCH] Update README.md --- README.md | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index eaff3e9..37e6b35 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,41 @@ # RWKV-LM -The RWKV Language Model + +We propose the RWKV language model, with alternating time-mix and channel-mix layers: + + + +* Here R, K, V is generated by linear transforms of input. + +* The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences. + +(1) We changed the softmax normalization. For masked language models, we define: + + + +(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c): + + + +Moreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors. + +* The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor. + +* Finally, we add extra time-mixing as in (https://github.com/BlinkDL/minGPT-tuned) + +*** + +Training loss, RWKV vs MHA+Rotary+GeGLU: + +![RWKV-vs-MHA](RWKV-vs-MHA.png) + +(this is character-level loss with simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip)