Update README.md

5 years ago · 1035a7438e
parent 4c6db5607c
commit 1035a7438e
1 changed files with 11 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -14,7 +14,7 @@ alt="\begin{align*}

 * The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.

-(1) We changed the softmax normalization. For masked language models, we define:
+(1) We changed the normalization (denominator). For masked language models, we define:

 <img src=
 "https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Ctext%7Bsoftmax%7D_t%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29+%3D+%5Cfrac%7B%5Cexp%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29%7D%7B%5Csum_%7Bv+%5Cleq+t%7D%5Cexp%28%5Ctext%7BK%7D_%7Bv%2Cc%7D%29%7D" 
@ -34,6 +34,16 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for

 ***

+We also propose a new sampling method (as in src/utils.py):
+
+(1) Find the max probability p_max after softmax.
+
+(2) Remove all entries whose probability is lower than 0.02 * pow(p_max, 2)
+
+(3) Feel free to tune the 0.02 and 2 factor.
+
+***
+
 Training loss, RWKV vs MHA+Rotary+GeGLU:

 ![RWKV-vs-MHA](RWKV-vs-MHA.png)