Update README.md

5 years ago · 1035a7438e
parent 4c6db5607c
commit 1035a7438e
1 changed files with 11 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -14,7 +14,7 @@ alt="\begin{align*}
 * The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.
-(1) We changed the softmax normalization. For masked language models, we define:
+(1) We changed the normalization (denominator). For masked language models, we define:
 <img src=
 "https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Ctext%7Bsoftmax%7D_t%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29+%3D+%5Cfrac%7B%5Cexp%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29%7D%7B%5Csum_%7Bv+%5Cleq+t%7D%5Cexp%28%5Ctext%7BK%7D_%7Bv%2Cc%7D%29%7D" 
@ -34,6 +34,16 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
 ***
 We also propose a new sampling method (as in src/utils.py):
 (1) Find the max probability p_max after softmax.
 (2) Remove all entries whose probability is lower than 0.02 * pow(p_max, 2)
 (3) Feel free to tune the 0.02 and 2 factor.
 ***
 Training loss, RWKV vs MHA+Rotary+GeGLU:
 ![RWKV-vs-MHA](RWKV-vs-MHA.png)