Update README.md

main
PENG Bo 4 years ago committed by GitHub
parent 4c6db5607c
commit 1035a7438e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -14,7 +14,7 @@ alt="\begin{align*}
* The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.
(1) We changed the softmax normalization. For masked language models, we define:
(1) We changed the normalization (denominator). For masked language models, we define:
<img src=
"https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Ctext%7Bsoftmax%7D_t%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29+%3D+%5Cfrac%7B%5Cexp%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29%7D%7B%5Csum_%7Bv+%5Cleq+t%7D%5Cexp%28%5Ctext%7BK%7D_%7Bv%2Cc%7D%29%7D"
@ -34,6 +34,16 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
***
We also propose a new sampling method (as in src/utils.py):
(1) Find the max probability p_max after softmax.
(2) Remove all entries whose probability is lower than 0.02 * pow(p_max, 2)
(3) Feel free to tune the 0.02 and 2 factor.
***
Training loss, RWKV vs MHA+Rotary+GeGLU:
![RWKV-vs-MHA](RWKV-vs-MHA.png)

Loading…
Cancel
Save