I need a better CUDA kernel (https://github.com/BlinkDL/RWKV-CUDA) to (1) pull off maxK so there's need to clamp k to 60. (2) fix divide-by-zero without using K_EPS. (3) support bf16/fp16. **Please let me know if you are a CUDA expert :)**
Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K.
## Explaining the code for RWKV-3 GPT mode
### The GPT mode - overview
@ -227,9 +223,7 @@ return rkv
```
The self.value, self.receptance matrices are all initialized to zero.