From c103f0caa3cb8371a2e77f814ede9505265c588d Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Mon, 27 Jun 2022 10:37:00 +0800 Subject: [PATCH] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f57cbfa..b369909 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ And it's also using a number of my tricks, such as: * Better initilization: I init most of the matrices to ZERO (see RWKV_Init in https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v2-RNN/src/model.py). -* You can transfer some parameters from a small model to a large model, for faster and better convergence (see https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/). +* You can transfer some parameters from a small model to a large model (note: I sort & smooth them too), for faster and better convergence (see https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/). * My CUDA kernel: https://github.com/BlinkDL/RWKV-CUDA to speedup training. @@ -173,7 +173,7 @@ rwkv = self.output(rwkv) # final output projection The self.key, self.receptance, self.output matrices are all initialized to zero. -The time_mix, time_decay, time_first vectors are transferred from a smaller trained model. +The time_mix, time_decay, time_first vectors are transferred from a smaller trained model (note: I sort & smooth them too). #### The GPT mode - FFN block