From c68ea168b16118061b8fc06104a87d9e3648e81b Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Fri, 13 Aug 2021 14:21:50 +0800 Subject: [PATCH] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 18ed9f3..7d061f4 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for * The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor. -* Finally, we add extra time-shift mixing as in (https://github.com/BlinkDL/minGPT-tuned). You can try reducing the amt of time-mixing in upper layers of deep models. +* Finally, we add extra time-shift mixing as in (https://github.com/BlinkDL/minGPT-tuned). *** @@ -48,7 +48,7 @@ when you train a GPT, the hidden representation of a token has to accomplish two the time_shifted channels can focus on (2). so we have good propagation of info. it's like some kind of residual connection. -you can use time_shift in usual QKV self-attention too. when i studied the weights, i found V really likes the time_shifted channel. less so for Q. makes sense if you think abt it. +you can use time_shift in usual QKV self-attention too. when i studied the weights, i found V really likes the time_shifted channels. less so for Q. makes sense if you think abt it. ***