From 1ea53a2f0331448553cc8c5e738252f830e44936 Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Sat, 14 Aug 2021 03:11:08 +0800
Subject: [PATCH] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 4f05a9c..fa8be87 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
 
 the time-shift mixing means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors. 
 
-i find divide by 2 and shift-1 is the best.  i looked at the weights and found you may want to use less mixing in higher layers.
+i found divide by 2 and shift-1 is the best for chinese LM.  you may want to use more shift for english char-level lm. i looked at the weights and found you may want to use less mixing in higher layers.
 
 here is my theory: