diff --git a/README.md b/README.md
index fa8be87..4e97c47 100644
--- a/README.md
+++ b/README.md
@@ -32,7 +32,7 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
 
 * Finally, we add extra time-shift mixing as in (https://github.com/BlinkDL/minGPT-tuned).
 
-***
+# Token-shift (time-shift mixing)
 
 the time-shift mixing means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors. 
 
@@ -50,11 +50,9 @@ the time_shifted channels can focus on (2). so we have good propagation of info.
 
 you can use time_shift in usual QKV self-attention too. when i studied the weights, i found V really likes the time_shifted channels. less so for Q. makes sense if you think abt it.
 
-***
+p.s. There is a MHA_pro model in this repo with strong performance. Give it a try :)
 
-p.s. There is aother MHA_pro model in this repo with strong performance. Give it a try :)
-
-***
+# Sampling method
 
 We also propose a new sampling method (as in src/utils.py):
 
@@ -64,7 +62,7 @@ We also propose a new sampling method (as in src/utils.py):
 
 (3) Feel free to tune the 0.02 and 2 factor.
 
-***
+# Performance
 
 Character-level loss on simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip