|
|
|
@ -101,6 +101,10 @@ I need a better CUDA kernel to (1) pull off maxK so there's need to clamp k to 6
|
|
|
|
|
|
|
|
|
|
|
|
Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K.
|
|
|
|
Removing the maxK limitation will also make it easy to clean the state of a KV-V channel, by using a huge K.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Namely, this is what I plan to do:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
========================================================================
|
|
|
|
========================================================================
|
|
|
|
|
|
|
|
|
|
|
|
### Explaining the code for RWKV v2+ GPT mode
|
|
|
|
### Explaining the code for RWKV v2+ GPT mode
|
|
|
|
|