Merge branch 'main' of https://github.com/BlinkDL/RWKV-LM into main

3 years ago · d008cc6d8e
parent c13879ab97 78579a00d2
commit d008cc6d8e
3 changed files with 59 additions and 51 deletions
--- a/README.md
+++ b/README.md
@ -10,6 +10,8 @@ So it's combining the best of RNN and transformer - **great performance, fast in

 **RWKV chatbot**: https://github.com/BlinkDL/ChatRWKV

+**HF space**: https://huggingface.co/spaces/yahma/rwkv-14b
+
 ![RWKV-chat](RWKV-chat.png)

 **You can run RWKV on low VRAM GPUs with this pip package:** https://github.com/harrisonvanderbyl/rwkvstic
@ -18,9 +20,7 @@ So it's combining the best of RNN and transformer - **great performance, fast in

 You are welcome to join the RWKV discord https://discord.gg/bDSBUMeFpc to build upon it. We have plenty of potential compute (A100 40Gs) now (thanks to Stability and EleutherAI), so if you have interesting ideas I can run them.

-Twitter: https://twitter.com/BlinkDL_AI
-
-I am training RWKV-4 14B on the Pile (final release around Feb-15-2023): https://wandb.ai/blinkdl/RWKV-v4-Pile
+**Twitter**: https://twitter.com/BlinkDL_AI

 ![RWKV-eval2](RWKV-eval2.png)

@ -65,38 +65,6 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai

 ![RWKV-demo](RWKV-demo.png)

-## New ideas (just to record some new ideas)
-
-I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example:
-
-Channel 0 = "space"
-
-Channel 1 = "capitalize first letter"
-
-Channel 2 = "capitalize all letters"
-
-Therefore:
-
-Embedding of "abc":  [0, 0, 0, x0, x1, x2 , ..]
-
-Embedding of " abc":  [1, 0, 0, x0, x1, x2, ..]
-
-Embedding of " Abc":  [1, 1, 0, x0, x1, x2, ..]
-
-Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...]
-
-......
-
-so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc".
-
-Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong.
-
-Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb.
-
-Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings.
-
-At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. My method can solve this. I plan to test this in a new version of RWKV.
-
 ## Quick start

 Use https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo (latest code, compatible with v4).
@ -108,37 +76,41 @@ prompt = f'\nQ & A\n\nQuestion:\n{qq}\n\nDetailed Expert Answer:\n' # let the mo

 **Cool Community RWKV Projects (check them!)**:

-https://pypi.org/project/rwkvstic/
+https://pypi.org/project/rwkvstic/ Easy pip package (with 8bit & offload for low VRAM GPUs)
+
+https://github.com/harrisonvanderbyl/rwkv_chatbot Chatbot using rwkvstic

-https://github.com/harrisonvanderbyl/rwkv_chatbot
+https://github.com/hizkifw/WebChatRWKVstic WebUI (WIP)

-https://github.com/mrsteyk/RWKV-LM-deepspeed
+https://github.com/gururise/rwkv_gradio RWKV Gradio

-https://github.com/wozeparrot/tinyrwkv
+https://github.com/mrsteyk/RWKV-LM-deepspeed Another training fork

-https://github.com/gururise/rwkv_gradio
+https://github.com/Blealtan/RWKV-LM-LoRA LoRA fine-tuning

-https://github.com/huggingface/transformers/issues/17230
+https://github.com/wozeparrot/tinyrwkv RWKV in tinygrad (nice simple DL framework)

-https://huggingface.co/spaces/Hazzzardous/RWKV-Instruct
+https://github.com/huggingface/transformers/issues/17230 RWKV HF package (WIP)

-https://github.com/ArEnSc/Production-RWKV
+https://github.com/ArEnSc/Production-RWKV RWKV HF package source

-https://github.com/nlpodyssey/verbaflow (in Go)
+https://github.com/nlpodyssey/verbaflow RWKV in Go

-https://github.com/nlpodyssey/rwkv (in Go)
+https://github.com/nlpodyssey/rwkv RWKV in Go

-https://github.com/mrsteyk/rwkvk-rs
+https://github.com/mrsteyk/rwkvk-rs RWKV in Rust

-https://github.com/resloved/RWKV-notebooks
+https://github.com/imxcstar/CSharp-RWKV-V4 RWKV in C#

-https://colab.research.google.com/github/harrisonvanderbyl/rwkvstic/blob/master/notebooks/chatbot.ipynb
+https://github.com/resloved/RWKV-notebooks RWKV colab notebooks

-https://github.com/Pathos14489/RWKVDistributedInference
+https://colab.research.google.com/github/harrisonvanderbyl/rwkvstic/blob/master/notebooks/chatbot.ipynb RWKV chatbot colab notebook

-https://github.com/AXKuhta/rwkv-onnx-dml
+https://github.com/Pathos14489/RWKVDistributedInference RWKV Distributed Inference

-https://github.com/josephrocca/rwkv-v4-web
+https://github.com/AXKuhta/rwkv-onnx-dml RWKV ONNX
+
+https://github.com/josephrocca/rwkv-v4-web RWKV-v4 running in the browser (simple demo. greedy decode)

 ### Inference

@ -202,6 +174,38 @@ ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
 out.write(ss + "\n")
 ```

+## New ideas (just to record some new ideas)
+
+I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example:
+
+Channel 0 = "space"
+
+Channel 1 = "capitalize first letter"
+
+Channel 2 = "capitalize all letters"
+
+Therefore:
+
+Embedding of "abc":  [0, 0, 0, x0, x1, x2 , ..]
+
+Embedding of " abc":  [1, 0, 0, x0, x1, x2, ..]
+
+Embedding of " Abc":  [1, 1, 0, x0, x1, x2, ..]
+
+Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...]
+
+......
+
+so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc".
+
+Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong.
+
+Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb.
+
+Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings.
+
+At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. My method can solve this. I plan to test this in a new version of RWKV.
+
 ## How it works

 RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103).
@ -397,6 +401,10 @@ I believe RWKV is performant because W is like repeatedly applying a diagonal ma

 Moreover it's possible to turn it into a continuous ODE (a bit similar to State Space Models). I will write about it later.

+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=BlinkDL/RWKV-LM&type=Date)](https://star-history.com/#BlinkDL/RWKV-LM&Date)
+
 ## Multimodal ideas

 I have an idea for [text --> 32x32 RGB image] using a LM (transformer, RWKV, etc.). Will test it soon.
--- a/RWKV-eval2.png
+++ b/RWKV-eval2.png
--- a/RWKV-loss.png
+++ b/RWKV-loss.png