From d6ff9a085f3b70e3bf661c809035f90aef293f7b Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Thu, 19 May 2022 17:25:35 +0800 Subject: [PATCH] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7677451..bc37b2b 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103). -However it's also using a number of my tricks, such as: +And it's also using a number of my tricks, such as: * SmallInitEmb: https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers) which helps the embedding quality, and stabilizes Post-LN (which is what I am using). @@ -50,7 +50,7 @@ However it's also using a number of my tricks, such as: * Extra R-gate in the FFN (applicable to all transformers). I am also using reluSquared from Primer. -* Better initilization: I init most of the matrices to ZERO (see RWKV_Init in https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v2-RNN/src/model.py) +* Better initilization: I init most of the matrices to ZERO (see RWKV_Init in https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v2-RNN/src/model.py). * You can transfer some parameters from a small model to a large model, for faster and better convergence (see https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/).