Update README.md

4 years ago · b54b204074
parent 6f7240e693
commit b54b204074
1 changed files with 5 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -95,7 +95,7 @@ xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
 xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
 ```

-Use preLN instead of postLN:
+Use preLN instead of postLN (more stable & faster convergence):
 ```
 if self.layer_id == 0:
 	x = self.ln0(x)
@ -103,6 +103,10 @@ x = x + self.att(self.ln1(x))
 x = x + self.ffn(self.ln2(x))
 ```

+I need a better CUDA kernel to (1) pull off maxK so there's need to clamp k to 60. (2) fix divide-by-zero without using K_EPS. (3) support bf16/fp16. Please let me know if you are a CUDA expert :)
+
+Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K.
+
 ### From GPT to RWKV-2 (the formulas)

 Let F[t] be the system state at t.