From 234aa8a5bb19cb84da1df40071949db65ae866a9 Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Mon, 27 Jun 2022 12:55:56 +0800
Subject: [PATCH] Update README.md

---
 README.md | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/README.md b/README.md
index fd38e9d..696ddab 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ And it's also using a number of my tricks, such as:
 
 * My CUDA kernel: https://github.com/BlinkDL/RWKV-CUDA to speedup training.
 
-### The pseudocode (execution from top to bottom):
+## The pseudocode (execution from top to bottom):
 
 ![RWKV-v2-RNN](RWKV-v2-RNN.png)
 
@@ -79,9 +79,7 @@ kv / k is the memory mechanism. The token with high k can be remembered for a lo
 
 **RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
 
-========================================================================
-
-### RWKV v2+ improvements (not yet uploaded to github. used in the latest 1.5B run)
+## RWKV v2+ improvements (not yet uploaded to github. used in the latest 1.5B run)
 
 Use different trainable TimeMix factors for R / K / V in SA and FF layers. Example:
 ```python
@@ -107,13 +105,11 @@ Namely, this is my plan:
 
 ![RWKV-v3-plan](RWKV-v3-plan.png)
 
-========================================================================
-
-### Explaining the code for RWKV v2+ GPT mode
+## Explaining the code for RWKV v2+ GPT mode
 
 Note: this is for the latest v2+ model.
 
-#### The GPT mode - overview
+### The GPT mode - overview
 
 The building blocks of RWKV-2 GPT mode are similar to that of a usual preLN GPT.
 
@@ -139,7 +135,7 @@ For the first 15B tokens, LR is fixed at 3e-4, and beta=(0.9, 0.99).
 
 Then I set beta=(0.9, 0.999), and do an exponential decay of LR, reaching 1e-5 at 332B tokens.
 
-#### The GPT mode - ATT block
+### The GPT mode - ATT block
 
 The RWKV-2 does not have any attention in the usual sense, but we will call this block ATT anyway.
 ```python
@@ -181,7 +177,7 @@ The self.key, self.receptance, self.output matrices are all initialized to zero.
 
 The time_mix, time_decay, time_first vectors are transferred from a smaller trained model (note: I sort & smooth them too).
 
-#### The GPT mode - FFN block
+### The GPT mode - FFN block
 
 The FFN block has three tricks comparing with the usual GPT:
 
@@ -207,9 +203,7 @@ return rkv
 ```
 The self.value, self.receptance matrices are all initialized to zero.
 
-========================================================================
-
-### From GPT to RWKV-2 (the formulas)
+## From GPT to RWKV-2 (the formulas)
 
 Let F[t] be the system state at t.