From b54b204074079914a1c2d90922ef1e3777095854 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Mon, 20 Jun 2022 17:52:50 +0800 Subject: [PATCH] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4185259..f645550 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,7 @@ xv = x * self.time_mix_v + xx * (1 - self.time_mix_v) xr = x * self.time_mix_r + xx * (1 - self.time_mix_r) ``` -Use preLN instead of postLN: +Use preLN instead of postLN (more stable & faster convergence): ``` if self.layer_id == 0: x = self.ln0(x) @@ -103,6 +103,10 @@ x = x + self.att(self.ln1(x)) x = x + self.ffn(self.ln2(x)) ``` +I need a better CUDA kernel to (1) pull off maxK so there's need to clamp k to 60. (2) fix divide-by-zero without using K_EPS. (3) support bf16/fp16. Please let me know if you are a CUDA expert :) + +Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K. + ### From GPT to RWKV-2 (the formulas) Let F[t] be the system state at t.