From 1301d383bbe30722ffcfac5345318ba8a96ffdf3 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Mon, 27 Jun 2022 00:58:45 +0800 Subject: [PATCH] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index baade69..f57cbfa 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ See the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run ru Training: https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN -You will be training the "GPT" version because it's paralleziable and faster to train. I find RWKV-2 can extrapolate, so training with ctxLen 768 can work for ctxLen of several thousand. You can fine-tune the model with longer ctxLen later and it can quickly adapt to longer ctxLens. +You will be training the "GPT" version because it's paralleziable and faster to train. I find RWKV-2 can extrapolate, so training with ctxLen 768 can work for ctxLen of 1000+. You can fine-tune the model with longer ctxLen and it can quickly adapt to longer ctxLens. **UPDATE: Search for "RWKV v2+" here and change RWKV-2 to PreLN to make it more stable.** @@ -82,7 +82,7 @@ kv / k is the memory mechanism. The token with high k can be remembered for a lo ### RWKV v2+ improvements (not yet uploaded to github. used in the latest 1.5B run) Use different trainable TimeMix factors for R / K / V in SA and FF layers. Example: -``` +```python xx = self.time_shift(x) xk = x * self.time_mix_k + xx * (1 - self.time_mix_k) xv = x * self.time_mix_v + xx * (1 - self.time_mix_v) @@ -90,7 +90,7 @@ xr = x * self.time_mix_r + xx * (1 - self.time_mix_r) ``` Use preLN instead of postLN (more stable & faster convergence): -``` +```python if self.layer_id == 0: x = self.ln0(x) x = x + self.att(self.ln1(x))