Update README.md

5 years ago · 6e2ba61d95
parent cd9b352b45
commit 6e2ba61d95
1 changed files with 2 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -48,6 +48,8 @@ when you train a GPT, the hidden representation of a token has to accomplish two

 the time_shifted channels can focus on (2). so we have good propagation of info. it's like some kind of residual connection.

+you can use time_shift in usual QKV self-attention too. when i studied the weights, i found V really likes time_shift. less so for Q. makes sense if you think abt it.
+
 ***

 p.s. There is aother MHA_pro model in this repo with strong performance. Give it a try :)