|
|
|
* Here R, K, V is generated by linear transforms of input. Basically RWKV decomposes attention into R(target) * W(src -> target) * K(src). So I call R "receptance", and sigmoid means it's in 0~1 range.
|
|
|
|
* Here R, K, V are generated by linear transforms of input, and W is parameter. Basically RWKV decomposes attention into R(target) * W(src, target) * K(src). So we can call R "receptance", and sigmoid means it's in 0~1 range.
|