From aeae6c8aacb7133cffd34b419f5bdb115f24342e Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Tue, 31 Jan 2023 11:31:41 +0800 Subject: [PATCH] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index e0a188e..25d0c52 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai ![RWKV-demo](RWKV-demo.png) -## New ideas (just to record all of my new ideas) +## New ideas (just to record some new ideas) I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example: @@ -91,6 +91,8 @@ Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...] so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc". +Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong. A better method is to define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb. + I plan to test this in a new version of RWKV. ## Quick start