From 78579a00d246ac88ec8d5c485ae494984bc7099c Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Sat, 18 Feb 2023 01:47:02 +0800 Subject: [PATCH] Update README.md --- README.md | 64 +++++++++++++++++++++++++++---------------------------- 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index dbb27a7..e0e1e28 100644 --- a/README.md +++ b/README.md @@ -65,38 +65,6 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai ![RWKV-demo](RWKV-demo.png) -## New ideas (just to record some new ideas) - -I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example: - -Channel 0 = "space" - -Channel 1 = "capitalize first letter" - -Channel 2 = "capitalize all letters" - -Therefore: - -Embedding of "abc": [0, 0, 0, x0, x1, x2 , ..] - -Embedding of " abc": [1, 0, 0, x0, x1, x2, ..] - -Embedding of " Abc": [1, 1, 0, x0, x1, x2, ..] - -Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...] - -...... - -so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc". - -Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong. - -Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb. - -Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings. - -At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. My method can solve this. I plan to test this in a new version of RWKV. - ## Quick start Use https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo (latest code, compatible with v4). @@ -206,6 +174,38 @@ ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False) out.write(ss + "\n") ``` +## New ideas (just to record some new ideas) + +I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example: + +Channel 0 = "space" + +Channel 1 = "capitalize first letter" + +Channel 2 = "capitalize all letters" + +Therefore: + +Embedding of "abc": [0, 0, 0, x0, x1, x2 , ..] + +Embedding of " abc": [1, 0, 0, x0, x1, x2, ..] + +Embedding of " Abc": [1, 1, 0, x0, x1, x2, ..] + +Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...] + +...... + +so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc". + +Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong. + +Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb. + +Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings. + +At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. My method can solve this. I plan to test this in a new version of RWKV. + ## How it works RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103).