From 6e2ba61d95b2d8e0ee28f40a849bdc868d821226 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Fri, 13 Aug 2021 13:54:24 +0800 Subject: [PATCH] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index b2fba21..089584a 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,8 @@ when you train a GPT, the hidden representation of a token has to accomplish two the time_shifted channels can focus on (2). so we have good propagation of info. it's like some kind of residual connection. +you can use time_shift in usual QKV self-attention too. when i studied the weights, i found V really likes time_shift. less so for Q. makes sense if you think abt it. + *** p.s. There is aother MHA_pro model in this repo with strong performance. Give it a try :)