diff --git a/README.md b/README.md
index 8e5bfd5..ed556aa 100644
--- a/README.md
+++ b/README.md
@@ -38,6 +38,12 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
 
 The token-shift means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors. 
 
+```
+self.time_shift = nn.ZeroPad2d((0,0,1,-1))
+
+x = torch.cat([self.time_shift(x[:, :, :C//2]), x[:, :, C//2:]], dim = -1)
+```
+
 I found dividing channels by 2 and shift-1 works the best for Chinese LM. You may want to use more shift for English char-level LM. I checked the weights and found you may want to use less mixing in higher layers.
 
 My theory on the effectiveness of token-shift:
@@ -64,6 +70,18 @@ We also propose a new sampling method called top-a (as in src/utils.py):
 
 (3) Feel free to tune the 0.02 and 2 factor.
 
+The idea of top-a:
+1. If max_prob=0.9, then remove all tokens with prob < 0.0162 (so, removing most alternatives)
+2. If max_prob=0.5, then remove all tokens with prob < 0.0050 (so, allowing more choices)
+3. If max_prob=0.1, then remove all tokens with prob < 0.0002 (so, allowing lots of possibilities)
+
+```
+probs = F.softmax(logits, dim=-1)
+
+limit = torch.pow(torch.max(probs), 2.0) * 0.02
+logits[probs < limit] = -float('Inf')
+```
+
 # Performance
 
 Character-level loss on simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip