From 4c6db5607c6f94c38c10004efb292510bc71ba59 Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Mon, 9 Aug 2021 14:31:02 +0800
Subject: [PATCH] Update README.md
---
README.md | 41 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index eaff3e9..37e6b35 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,41 @@
# RWKV-LM
-The RWKV Language Model
+
+We propose the RWKV language model, with alternating time-mix and channel-mix layers:
+
+
+
+* Here R, K, V is generated by linear transforms of input.
+
+* The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.
+
+(1) We changed the softmax normalization. For masked language models, we define:
+
+
+
+(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c):
+
+
+
+Moreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors.
+
+* The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor.
+
+* Finally, we add extra time-mixing as in (https://github.com/BlinkDL/minGPT-tuned)
+
+***
+
+Training loss, RWKV vs MHA+Rotary+GeGLU:
+
+
+
+(this is character-level loss with simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip)