diff --git a/README.md b/README.md
index bab0185..20e22b5 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ So it's combining the best of RNN and transformer - **great performance, fast in
 
 Inference speed on single A40 (tf32):
 
-RWKV-2 1.5B = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
+RWKV-3 1.5B = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
 
 GPT2-XL 1.3B = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
 
@@ -73,7 +73,7 @@ Moreover it's using a number of my tricks, such as:
 
 ![RWKV-v2-RNN](RWKV-v2-RNN.png)
 
-The a b c d factors work together to build a time-decay curve: X, 1, W, W^2, W^3, ...
+The a b c d factors work together to build a time-decay curve: [X, 1, W, W^2, W^3, ...].
 
 Write out the formulas for "token at pos 2" and "token at pos 3" and you will get the idea:
 * a and b: EMAs of kv and k.
@@ -83,6 +83,8 @@ kv / k is the memory mechanism. The token with high k can be remembered for a lo
 
 **RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
 
+The R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token.
+
 ## RWKV-3 improvements (used in the latest 1.5B run)
 
 Use different trainable TimeMix factors for R / K / V in SA and FF layers. Example: