From 34fa2ec81bf51c13c11b3f28721b76e56ff9075b Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Wed, 22 Sep 2021 20:45:45 +0800
Subject: [PATCH] Update README.md

---
 README.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index d83c904..6f486bb 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
 
 # Token-shift (time-shift mixing)
 
-The token-shift means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors. 
+The token-shift explicitly uses (half the channels of this token) & (half the channels of prev token) to generate all vectors (QKV, RWKV, ...).
 
 ```
 self.time_shift = nn.ZeroPad2d((0,0,1,-1))
@@ -44,7 +44,9 @@ self.time_shift = nn.ZeroPad2d((0,0,1,-1))
 x = torch.cat([self.time_shift(x[:, :, :C//2]), x[:, :, C//2:]], dim = -1)
 ```
 
-I found dividing channels by 2 and shift-1 works the best for Chinese LM. You may want to use more shift for English char-level LM. I checked the weights and found you may want to use less mixing in higher layers.
+Dividing channels by 2 and shift-1 works great for char-level English and char-level Chinese LM.
+
+However for BPE-level English LM, it's only effective if your embedding is large enough (at least 1024 - so the usual small L12-D768 model is not enough).
 
 My theory on the effectiveness of token-shift:
 
@@ -56,7 +58,7 @@ When we train a GPT, the hidden representation of a token has to accomplish two
 
 The shifted channels can focus on (2), so we have good propagation of info. It's like some kind of residual connection, or a small RNN inside the transformer.
 
-You can use token-shift in usual QKV self-attention too. I looked at the weights, and found V really likes the shifted channels, less so for Q. Makes sense if you think about it.
+You can use token-shift in usual QKV self-attention too. I looked at the weights, and found V really likes the shifted channels, less so for Q. Makes sense if you think about it. I also found you may want to use less mixing in higher layers.
 
 p.s. There is a MHA_pro model in this repo with strong performance. Give it a try :)