diff --git a/README.md b/README.md index a41758c..bab0185 100644 --- a/README.md +++ b/README.md @@ -39,25 +39,21 @@ Check https://github.com/BlinkDL/RWKV-v2-RNN-Pile for L24-D1024 and L12-D768 mod Read the inference code in https://github.com/BlinkDL/RWKV-v2-RNN-Pile/blob/main/src/model.py and try using the final hidden state(.xx .aa .bb) as a faithful sentence embedding for other tasks (probably you shall begin with .xx and .aa/.bb (.aa divided by .bb)). -See the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run run.py in https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN. You can even run it in your browser: https://github.com/BlinkDL/AI-Writer/tree/main/docs/eng https://blinkdl.github.io/AI-Writer/eng/ (this is using tf.js WASM single-thread mode). +For RWKV-2: see the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run run.py in https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN. You can even run it in your browser: https://github.com/BlinkDL/AI-Writer/tree/main/docs/eng https://blinkdl.github.io/AI-Writer/eng/ (this is using tf.js WASM single-thread mode). ### Training / Fine-tuning -Colab for fine-tuning: https://colab.research.google.com/drive/1BwceyZczs5hQr1wefmCREonEWhY-zeST - -Training: https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN +Training RWKV-3: https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v3 You will be training the "GPT" version because it's paralleziable and faster to train. I find RWKV can extrapolate, so training with ctxLen 768 can work for ctxLen of 1000+. You can fine-tune the model with longer ctxLen and it can quickly adapt to longer ctxLens. -**UPDATE: Search for "RWKV-3" here (which is using PreLN) to make it more stable.** - -Fine-tuning: see https://github.com/BlinkDL/RWKV-v2-RNN-Pile. +Colab for fine-tuning the Pile models: https://colab.research.google.com/drive/1BwceyZczs5hQr1wefmCREonEWhY-zeST ## How it works RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103). -And it's also using a number of my tricks, such as: +Moreover it's using a number of my tricks, such as: * SmallInitEmb: https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers) which helps the embedding quality, and stabilizes Post-LN (which is what I am using). @@ -87,7 +83,7 @@ kv / k is the memory mechanism. The token with high k can be remembered for a lo **RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. -## RWKV-3 improvements (not yet uploaded to github. used in the latest 1.5B run) +## RWKV-3 improvements (used in the latest 1.5B run) Use different trainable TimeMix factors for R / K / V in SA and FF layers. Example: ```python @@ -111,8 +107,6 @@ Removing the maxK limitation will also make it easy to clean the state of a KV-V ## Explaining the code for RWKV-3 GPT mode -Note: this is for the latest RWKV-3 model. - ### The GPT mode - overview The building blocks of RWKV-3 GPT mode are similar to that of a usual preLN GPT.