22 changed files with 300 additions and 1446 deletions
--- a/README.md
+++ b/README.md
@ -1,65 +1,28 @@
-# The RWKV Language Model (and my LM tricks)
+# The RWKV Language Model (and my tricks for LMs)
-## RWKV: Parallelizable RNN with Transformer-level LLM Performance (pronounced as "RwaKuv", from 4 major params: R W K V)
+## RWKV: RNN with Transformer-level LLM Performance
-RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.
+RWKV is a RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly computer the hidden state for the "RNN" mode.
 So it's combining the best of RNN and transformer - **great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding** (using the final hidden state).
-**HuggingFace Gradio demo (14B ctx8192)**: https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
+RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
 Raven (7B finetuned on Alpaca) Demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
 **ChatRWKV:** with "stream" and "split" strategies and INT8. **3G VRAM is enough to run RWKV 14B :)** https://github.com/BlinkDL/ChatRWKV
 **RWKV pip package**: https://pypi.org/project/rwkv/
 ```python
 os.environ["RWKV_JIT_ON"] = '1'
 os.environ["RWKV_CUDA_ON"] = '0' # if '1' then use CUDA kernel for seq mode (much faster)
 from rwkv.model import RWKV                         # pip install rwkv
 model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')
 out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
 print(out.detach().cpu().numpy())                   # get logits
 out, state = model.forward([187, 510], None)
 out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
 out, state = model.forward([310, 247], state)
 print(out.detach().cpu().numpy())                   # same result as above
 ```
 **Download RWKV-4 0.1/0.4/1.5/3/7/14B weights**: https://huggingface.co/BlinkDL
 ## Join Our Discord: https://discord.gg/bDSBUMeFpc (lots of developers)
-**Twitter**: https://twitter.com/BlinkDL_AI
+GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
-**RWKV in 150 lines** (model, inference, text generation): https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py
+Training speed: RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.
-ChatRWKV with RWKV 14B ctx8192:
+I am doing image experiments too (For example: https://huggingface.co/BlinkDL/clip-guided-binary-autoencoder) and RWKV will be able to do txt2img diffusion :) My idea: 256x256 rgb image -> 32x32x13bit latents -> apply RWKV to compute transition probability for each of the 32x32 grid -> pretend the grids are independent and "diffuse" using these probabilities.
-![RWKV-chat](RWKV-chat.png)
+## Join our Discord: https://discord.gg/bDSBUMeFpc :)
 You are welcome to join the RWKV discord https://discord.gg/bDSBUMeFpc to build upon it. We have plenty of potential compute (A100 40Gs) now (thanks to Stability and EleutherAI), so if you have interesting ideas I can run them.
-![RWKV-eval2](RWKV-eval2.png)
+Twitter: https://twitter.com/BlinkDL_AI
-RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile. RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)
+**Download RWKV-4 0.1/0.4/1.5/3/7/14B weights**: https://huggingface.co/BlinkDL
 ![RWKV-ctxlen](RWKV-ctxlen.png)
 I believe RNN is a better candidate for fundamental models, because: (1) It's more friendly for ASICs (no kv cache). (2) It's more friendly for RL. (3) When we write, our brain is more similar to RNN. (4) The universe is like an RNN too (because of locality). Transformers are non-local models.
 RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
 GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
 Training speed: (new training code) RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G (ZERO2+CP). (old training code) RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.
 I am doing image experiments too (For example: https://huggingface.co/BlinkDL/clip-guided-binary-autoencoder) and RWKV will be able to do txt2img diffusion :) My idea: 256x256 rgb image -> 32x32x13bit latents -> apply RWKV to compute transition probability for each of the 32x32 grid -> pretend the grids are independent and "diffuse" using these probabilities.
-Smooth training - no loss spikes! (lr & bsz change around 15G tokens)
+I am training RWKV-4 14B on the Pile: https://wandb.ai/blinkdl/RWKV-v4-Pile
 ![RWKV-loss](RWKV-loss.png)
 ![RWKV-eval](RWKV-eval.png)
@ -69,8 +32,6 @@ How it works: RWKV gathers information to a number of channels, which are also d
 **RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. Moreover, you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance.
 ![RWKV-formula](RWKV-formula.png)
 Here are some of my TODOs. Let's work together :)
 * HuggingFace integration (check https://github.com/huggingface/transformers/issues/17230
@ -93,64 +54,31 @@ You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai
 ## Quick start
-Use https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo (latest code, compatible with v4).
+Use https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4 or https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo (latest code).
-Here is a great prompt for testing Q&A of LLMs. Works for any model: (found by minimizing ChatGPT ppls for RWKV 1.5B)
+### Inference
 ```python
 prompt = f'\nQ & A\n\nQuestion:\n{qq}\n\nDetailed Expert Answer:\n' # let the model generate after this
 ```
 **Cool Community RWKV Projects (check them!)**:
 https://pypi.org/project/rwkvstic/ a pip package (with 8bit & offload for low VRAM GPUs)
 https://github.com/harrisonvanderbyl/rwkv_chatbot a chatbot
 https://github.com/hizkifw/WebChatRWKVstic WebUI (WIP)
 https://github.com/gururise/rwkv_gradio RWKV Gradio
 https://github.com/cryscan/eloise RWKV QQ bot
 https://github.com/Blealtan/RWKV-LM-LoRA LoRA fine-tuning
 https://github.com/mrsteyk/RWKV-LM-jax
 https://github.com/wozeparrot/tinyrwkv RWKV in tinygrad (nice simple DL framework)
 https://github.com/huggingface/transformers/issues/17230 RWKV HF package (WIP)
 https://github.com/ArEnSc/Production-RWKV RWKV HF package source
 https://github.com/nlpodyssey/verbaflow RWKV in Go
-https://github.com/nlpodyssey/rwkv RWKV in Go
+**Run RWKV-4 Pile models:** Download models from https://huggingface.co/BlinkDL. Set TOKEN_MODE = 'pile' in run.py and run it. It's fast even on CPU (the default mode).
-https://github.com/mrsteyk/rwkvk-rs RWKV in Rust
+**Colab for RWKV-4 Pile 1.5B**: https://colab.research.google.com/drive/1F7tZoPZaWJf1fsCmZ5tjw6sYHiFOYVWM
-https://github.com/josephrocca/rwkv-v4-web RWKV in browser
+Run RWKV-4 Pile models in your browser (and onnx version): see this issue https://github.com/BlinkDL/RWKV-LM/issues/7
-https://github.com/imxcstar/CSharp-RWKV-V4 RWKV in C#
+RWKV-4 Web Demo: https://josephrocca.github.io/rwkv-v4-web/demo/ (note: only greedy sampling for now)
-https://github.com/mrsteyk/RWKV-LM-deepspeed Another training fork
+More resources:
-https://github.com/resloved/RWKV-notebooks RWKV colab notebooks
+https://github.com/huggingface/transformers/issues/17230
-https://colab.research.google.com/github/harrisonvanderbyl/rwkvstic/blob/master/notebooks/chatbot.ipynb RWKV chatbot colab notebook
+https://github.com/ArEnSc/Production-RWKV
-https://github.com/Pathos14489/RWKVDistributedInference RWKV Distributed Inference
+https://github.com/harrisonvanderbyl/rwkv_chatbot
-https://github.com/AXKuhta/rwkv-onnx-dml RWKV ONNX
+https://github.com/Pathos14489/RWKVDistributedInference
-### Inference
+https://github.com/AXKuhta/rwkv-onnx-dml
-**Run RWKV-4 Pile models:** Download models from https://huggingface.co/BlinkDL. Set TOKEN_MODE = 'pile' in run.py and run it. It's fast even on CPU (the default mode).
+https://github.com/josephrocca/rwkv-v4-web
 **Colab for RWKV-4 Pile 1.5B**: https://colab.research.google.com/drive/1F7tZoPZaWJf1fsCmZ5tjw6sYHiFOYVWM
 Run RWKV-4 Pile models in your browser (and onnx version): see this issue https://github.com/BlinkDL/RWKV-LM/issues/7
 RWKV-4 Web Demo: https://josephrocca.github.io/rwkv-v4-web/demo/ (note: only greedy sampling for now)
 For the old RWKV-2: see the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run run.py in https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN. You can even run it in your browser: https://github.com/BlinkDL/AI-Writer/tree/main/docs/eng https://blinkdl.github.io/AI-Writer/eng/ (this is using tf.js WASM single-thread mode).
@ -162,86 +90,10 @@ You will be training the "GPT" version because it's paralleziable and faster to
 **Fine-tuning RWKV-4 Pile models:** use 'prepare-data.py' in https://github.com/BlinkDL/RWKV-v2-RNN-Pile/tree/main/RWKV-v3 to tokenize .txt into train.npy data. Then set EXPRESS_PILE_MODE to True in train.py, and run it.
-Read the inference code in src/model.py and try using the final hidden state（.xx .aa .bb) as a faithful sentence embedding for other tasks. Probably you should begin with .xx and .aa/.bb (.aa divided by .bb).
+Read the inference code in src/model.py and try using the final hidden state（.xx .aa .bb) as a faithful sentence embedding for other tasks. Probably you shall begin with .xx and .aa/.bb (.aa divided by .bb).
 Colab for fine-tuning RWKV-4 Pile models: https://colab.research.google.com/github/resloved/RWKV-notebooks/blob/master/RWKV_v4_RNN_Pile_Fine_Tuning.ipynb
 **Large corpus:** Use https://github.com/EleutherAI/gpt-neox to convert .jsonl into .bin and .idx
 ```
 python tools/preprocess_data.py --input ./my_data.jsonl --output-prefix ./data/my_data --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod
 ```
 The jsonl format sample (one line for each document):
 ```
 {"meta": {"ID": 101}, "text": "This is the first document."}
 {"meta": {"ID": 102}, "text": "Hello\nWorld"}
 {"meta": {"ID": 103}, "text": "1+1=2\n1+2=3\n2+2=4"}
 ```
 generated by code like this:
 ```
 ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
 out.write(ss + "\n")
 ```
 ## Towards RWKV-5 (just to record some new ideas)
 ### Some ideas
 1. Now time decay is like 0.999^T (0.999 is learnable). Change it to something like (0.999^T + 0.1) where 0.1 is learnable too. The 0.1 part will be kept forever. Or, A^T + B^T + C = fast-decay + slow-decay + constant. Can even use different formulas (for example, K^2 instead of e^K for a decay component, or, without normalization).
 2. Use complex-valued decay (so, rotation instead of decay) in some channels.
 3. Inject some trainable and extrapolatable positional encoding?
 4. Aside from 2d rotation, we can try other Lie groups such as 3d rotation ( SO(3) ). Non-abelian RWKV lol.
 5. RWKV might be great on analog devices (search for Analog Matrix-vector multiplication & Photonic Matrix-vector multiplication). The RNN mode is very hardware-friendly (processing-in-memory). Can be a SNN too (https://github.com/ridgerchu/SpikeGPT). I wonder if it can be optimized for quantum computation.
 6. Trainable initial hidden state (xx aa bb pp xx).
 ### Vision Tasks
 1. I find it's good to add a 2d pos encoding:
 ```
 self.pos_emb_x = nn.Parameter(torch.zeros((1,args.my_pos_emb,args.n_embd)))
 self.pos_emb_y = nn.Parameter(torch.zeros((args.my_pos_emb,1,args.n_embd)))
 ...
 x = x + pos_emb_x + pos_emb_y
 ```
 2. In a BPE langauge model, it's the best to use [tokenShift of 1 token] (you can mix more tokens in a char-level English model). However you can try [tokenShift of N (or N-1) (or N+1) tokens] if the image size is N x N, because that will be like mixing [the token above the current positon (or the token above the to-be-predicted positon)] with [current token]. You can use try different tokenShift styles for "ATT" & "FFN", or mixing different tokenShift styles - such as mixing [token A] with [token A-1] and [token A-(N-1)] etc.
 ### Misc
 I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example:
 Channel 0 = "space"
 Channel 1 = "capitalize first letter"
 Channel 2 = "capitalize all letters"
 Therefore:
 Embedding of "abc":  [0, 0, 0, x0, x1, x2 , ..]
 Embedding of " abc":  [1, 0, 0, x0, x1, x2, ..]
 Embedding of " Abc":  [1, 1, 0, x0, x1, x2, ..]
 Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...]
 ......
 so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc".
 Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong.
 Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb.
 Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings.
 At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. The method here can improve this. I plan to test this in a new version of RWKV.
 ## How it works
 RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103).
@ -437,10 +289,6 @@ I believe RWKV is performant because W is like repeatedly applying a diagonal ma
 Moreover it's possible to turn it into a continuous ODE (a bit similar to State Space Models). I will write about it later.
 ## Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=BlinkDL/RWKV-LM&type=Date)](https://star-history.com/#BlinkDL/RWKV-LM&Date)
 ## Multimodal ideas
 I have an idea for [text --> 32x32 RGB image] using a LM (transformer, RWKV, etc.). Will test it soon.
@ -463,7 +311,7 @@ Multi-task training might help too. I will try this dataset format:
 [TxtFirst] [Desc of Img (txt tokens)] [Img] [img tokens]
 and sometimes
 [ImgFirst] [img tokens] [Txt] [Desc of Img (txt tokens)]
-... the order of the imgs should be randomized in the DataLoader, and [TxtFirst] [ImgFirst] [Img] [Txt] are special tokens
+... the order of the imgs shall be randomized in the DataLoader, and [TxtFirst] [ImgFirst] [Img] [Txt] are special tokens
 and do random sampling of the full dataset. So sometimes the model will see the img tokens first and then the corresponding txt tokens, which is a [img -> txt] task. And the model will see some partial imgs and partial txts. I think a char-level LM might help the model to write correct text on images.
 ## How to sample a large dataset (for training)
--- a/RWKV-chat.png
+++ b/RWKV-chat.png
--- a/RWKV-ctxlen.png
+++ b/RWKV-ctxlen.png
--- a/RWKV-demo.png
+++ b/RWKV-demo.png
--- a/RWKV-eval.png
+++ b/RWKV-eval.png
--- a/RWKV-eval2.png
+++ b/RWKV-eval2.png
--- a/RWKV-formula.png
+++ b/RWKV-formula.png
--- a/RWKV-loss.png
+++ b/RWKV-loss.png
--- a/RWKV-v4/src/binidx.py
+++ b/RWKV-v4/src/binidx.py
@ -27,7 +27,7 @@ dtypes = {
    3: np.int16,
    4: np.int32,
    5: np.int64,
-    6: float,
+    6: np.float,
    7: np.double,
    8: np.uint16,
 }
--- a/RWKV-v4neo/chat.py
+++ b/RWKV-v4neo/chat.py
@ -1,361 +0,0 @@
 ########################################################################################################
 # The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
 ########################################################################################################
 print('Loading...')
 from src.model_run import RWKV_RNN
 import numpy as np
 import os, copy, types, gc, sys
 import torch
 from src.utils import TOKENIZER
 try:
    os.environ["CUDA_VISIBLE_DEVICES"] = sys.argv[1]
 except:
    pass
 torch.backends.cudnn.benchmark = True
 torch.backends.cudnn.allow_tf32 = True
 torch.backends.cuda.matmul.allow_tf32 = True
 np.set_printoptions(precision=4, suppress=True, linewidth=200)
 CHAT_LANG = 'English' # English Chinese
 WORD_NAME = [
    "20B_tokenizer.json",
    "20B_tokenizer.json",
 ]  # [vocab, vocab] for Pile model
 UNKNOWN_CHAR = None
 tokenizer = TOKENIZER(WORD_NAME, UNKNOWN_CHAR=UNKNOWN_CHAR)
 args = types.SimpleNamespace()
 args.RUN_DEVICE = "cuda"  # 'cpu' (already very fast) // 'cuda'
 args.FLOAT_MODE = "fp16" # fp32 (good for CPU) // fp16 (recommended for GPU) // bf16 (less accurate)
 args.vocab_size = 50277
 args.head_qk = 0
 args.pre_ffn = 0
 args.grad_cp = 0
 args.my_pos_emb = 0
 args.MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-14b/RWKV-4-Pile-14B-20230108-5170'
 args.n_layer = 40
 args.n_embd = 5120
 args.ctx_len = 1024
 # args.MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-7b/RWKV-4-Pile-7B-20221115-8047'
 # args.n_layer = 32
 # args.n_embd = 4096
 # args.ctx_len = 1024
 # args.MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-3b/RWKV-4-Pile-3B-20221008-8023'
 # args.n_layer = 32
 # args.n_embd = 2560
 # args.ctx_len = 1024
 if CHAT_LANG == 'English':
    user = "User"
    bot = "Bot"
    interface = ":"
    # The following is a verbose and detailed conversation between an AI assistant called {bot}, and a human user called {user}. {bot} is intelligent, knowledgeable, wise and polite.
    # The following is a conversation between a highly knowledgeable and intelligent AI called {bot}, and a human called {user}. In the following interactions, {user} and {bot} converse in natural language, and {bot} do its best to answer {user}'s questions. {bot} is respectful, polite and inclusive. {bot} knows a lot, and always tells the truth.
    init_prompt = f'''
 The following is a verbose and detailed conversation between an AI assistant called {bot}, and a human user called {user}. {bot} is intelligent, knowledgeable, wise and polite.
 {user}{interface} french revolution what year
 {bot}{interface} The French Revolution started in 1789, and lasted 10 years until 1799.
 {user}{interface} 3+5=?
 {bot}{interface} The answer is 8.
 {user}{interface} guess i marry who ?
 {bot}{interface} Only if you tell me more about yourself - what are your interests?
 {user}{interface} solve for a: 9-a=2
 {bot}{interface} The answer is a = 7, because 9 - 7 = 2.
 {user}{interface} wat is lhc
 {bot}{interface} LHC is a high-energy particle collider, built by CERN, and completed in 2008. They used it to confirm the existence of the Higgs boson in 2012.
 '''
    HELP_MSG = '''Commands:
 say something --> chat with bot. use \\n for new line.
 +alt --> alternate chat reply
 +reset --> reset chat
 +gen YOUR PROMPT --> free generation with any prompt. use \\n for new line.
 +qa YOUR QUESTION --> free generation - ask any question (just ask the question). use \\n for new line.
 +more --> continue last free generation (only for +gen / +qa)
 +retry --> retry last free generation (only for +gen / +qa)
 Now talk with the bot and enjoy. Remember to +reset periodically to clean up the bot's memory. Use RWKV-4 14B for best results.
 This is not instruct-tuned for conversation yet, so don't expect good quality. Better use +gen for free generation.
 '''
 elif CHAT_LANG == 'Chinese':
    args.MODEL_NAME = '/fsx/BlinkDL/CODE/_PUBLIC_/RWKV-LM/RWKV-v4neo/7-run3z/rwkv-293'
    args.n_layer = 32
    args.n_embd = 4096
    args.ctx_len = 1024
    user = "Q"
    bot = "A"
    interface = ":"
    init_prompt = '''
 Q: 企鹅会飞吗？
 A: 企鹅是不会飞的。它们的翅膀主要用于游泳和平衡，而不是飞行。
 Q: 西瓜是什么
 A: 西瓜是一种常见的水果，是一种多年生蔓生藤本植物。西瓜的果实呈圆形或卵形，通常是绿色的，里面有红色或黄色的肉和很多的籽。西瓜味甜，多吃可以增加水分，是夏季非常受欢迎的水果之一。
 '''
    HELP_MSG = '''指令:
 直接输入内容 --> 和机器人聊天，用\\n代表换行
 +alt --> 让机器人换个回答
 +reset --> 重置对话
 +gen 某某内容 --> 续写任何中英文内容，用\\n代表换行
 +qa 某某问题 --> 问独立的问题（忽略上下文），用\\n代表换行
 +more --> 继续 +gen / +qa 的回答
 +retry --> 换个 +gen / +qa 的回答
 现在可以输入内容和机器人聊天（注意它不怎么懂中文，它可能更懂英文）。请经常使用 +reset 重置机器人记忆。
 '''
 # Load Model
 os.environ["RWKV_RUN_DEVICE"] = args.RUN_DEVICE
 MODEL_NAME = args.MODEL_NAME
 print(f'loading... {MODEL_NAME}')
 model = RWKV_RNN(args)
 model_tokens = []
 current_state = None
 ########################################################################################################
 def run_rnn(tokens, newline_adj = 0):
    global model_tokens, current_state
    for i in range(len(tokens)):
        model_tokens += [int(tokens[i])]
        if i == len(tokens) - 1:
            out, current_state = model.forward(model_tokens, current_state)
        else:
            current_state = model.forward(model_tokens, current_state, preprocess_only = True)
    # print(f'### model ###\n[{tokenizer.tokenizer.decode(model_tokens)}]')
    out[0] = -999999999  # disable <|endoftext|>
    out[187] += newline_adj
    # if newline_adj > 0:
    #     out[15] += newline_adj / 2 # '.'
    return out
 all_state = {}
 def save_all_stat(srv, name, last_out):
    n = f'{name}_{srv}'
    all_state[n] = {}
    all_state[n]['out'] = last_out
    all_state[n]['rnn'] = copy.deepcopy(current_state)
    all_state[n]['token'] = copy.deepcopy(model_tokens)
 def load_all_stat(srv, name):
    global model_tokens, current_state
    n = f'{name}_{srv}'
    current_state = copy.deepcopy(all_state[n]['rnn'])
    model_tokens = copy.deepcopy(all_state[n]['token'])
    return all_state[n]['out']
 ########################################################################################################
 # Run inference
 print(f'\nRun prompt...')
 out = run_rnn(tokenizer.tokenizer.encode(init_prompt))
 gc.collect()
 torch.cuda.empty_cache()
 save_all_stat('', 'chat_init', out)
 srv_list = ['dummy_server']
 for s in srv_list:
    save_all_stat(s, 'chat', out)
 print(f'### prompt ###\n[{tokenizer.tokenizer.decode(model_tokens)}]\n')
 def reply_msg(msg):
    print(f'{bot}{interface} {msg}\n')
 def on_message(message):
    global model_tokens, current_state
    srv = 'dummy_server'
    msg = message.replace('\\n','\n').strip()
    if len(msg) > 1000:
        reply_msg('your message is too long (max 1000 tokens)')
        return
    x_temp = 1.0
    x_top_p = 0.85
    if ("-temp=" in msg):
        x_temp = float(msg.split("-temp=")[1].split(" ")[0])
        msg = msg.replace("-temp="+f'{x_temp:g}', "")
        # print(f"temp: {x_temp}")
    if ("-top_p=" in msg):
        x_top_p = float(msg.split("-top_p=")[1].split(" ")[0])
        msg = msg.replace("-top_p="+f'{x_top_p:g}', "")
        # print(f"top_p: {x_top_p}")
    if x_temp <= 0.2:
        x_temp = 0.2
    if x_temp >= 5:
        x_temp = 5
    if x_top_p <= 0:
        x_top_p = 0
    if msg == '+reset':
        out = load_all_stat('', 'chat_init')
        save_all_stat(srv, 'chat', out)
        reply_msg("Chat reset.")
        return
    elif msg[:5].lower() == '+gen ' or msg[:4].lower() == '+qa ' or msg.lower() == '+more' or msg.lower() == '+retry':
        if msg[:5].lower() == '+gen ':
            new = '\n' + msg[5:].strip()
            # print(f'### prompt ###\n[{new}]')
            current_state = None
            out = run_rnn(tokenizer.tokenizer.encode(new))
            save_all_stat(srv, 'gen_0', out)
        elif msg[:4].lower() == '+qa ':
            out = load_all_stat('', 'chat_init')
            real_msg = msg[4:].strip()
            new = f"{user}{interface} {real_msg}\n\n{bot}{interface}"
            # print(f'### qa ###\n[{new}]')
            out = run_rnn(tokenizer.tokenizer.encode(new))
            save_all_stat(srv, 'gen_0', out)
            # new = f"\nThe following is an excellent Q&A session consists of detailed and factual information.\n\nQ: What is 3+5?\nA: The answer is 8.\n\nQ: {msg[9:].strip()}\nA:"
            # print(f'### prompt ###\n[{new}]')
            # current_state = None
            # out = run_rnn(tokenizer.tokenizer.encode(new))
            # save_all_stat(srv, 'gen_0', out)
        elif msg.lower() == '+more':
            try:
                out = load_all_stat(srv, 'gen_1')
                save_all_stat(srv, 'gen_0', out)
            except:
                return
        elif msg.lower() == '+retry':
            try:
                out = load_all_stat(srv, 'gen_0')
            except:
                return
        begin = len(model_tokens)
        out_last = begin
        for i in range(150):
            token = tokenizer.sample_logits(
                out,
                model_tokens,
                args.ctx_len,
                temperature=x_temp,
                top_p_usual=x_top_p,
                top_p_newline=x_top_p,
            )
            if msg[:4].lower() == '+qa ':
                out = run_rnn([token], newline_adj=-1)
            else:
                out = run_rnn([token])
            xxx = tokenizer.tokenizer.decode(model_tokens[out_last:])
            if '\ufffd' not in xxx:
                print(xxx, end='', flush=True)
                out_last = begin + i + 1
        print('\n')
        # send_msg = tokenizer.tokenizer.decode(model_tokens[begin:]).strip()
        # print(f'### send ###\n[{send_msg}]')
        # reply_msg(send_msg)
        save_all_stat(srv, 'gen_1', out)
    else:
        if msg.lower() == '+alt':
            try:
                out = load_all_stat(srv, 'chat_pre')
            except:
                return
        else:
            out = load_all_stat(srv, 'chat')
            new = f"{user}{interface} {msg}\n\n{bot}{interface}"
            # print(f'### add ###\n[{new}]')
            out = run_rnn(tokenizer.tokenizer.encode(new), newline_adj=-999999999)
            save_all_stat(srv, 'chat_pre', out)
        begin = len(model_tokens)
        out_last = begin
        print(f'{bot}{interface}', end='', flush=True)
        for i in range(999):
            if i <= 0:
                newline_adj = -999999999
            elif i <= 30:
                newline_adj = (i - 30) / 10
            elif i <= 130:
                newline_adj = 0
            else:
                newline_adj = (i - 130) * 0.25 # MUST END THE GENERATION
            token = tokenizer.sample_logits(
                out,
                model_tokens,
                args.ctx_len,
                temperature=x_temp,
                top_p_usual=x_top_p,
                top_p_newline=x_top_p,
            )
            out = run_rnn([token], newline_adj=newline_adj)
            xxx = tokenizer.tokenizer.decode(model_tokens[out_last:])
            if '\ufffd' not in xxx:
                print(xxx, end='', flush=True)
                out_last = begin + i + 1
            send_msg = tokenizer.tokenizer.decode(model_tokens[begin:])
            if '\n\n' in send_msg:
                send_msg = send_msg.strip()
                break
            # send_msg = tokenizer.tokenizer.decode(model_tokens[begin:]).strip()
            # if send_msg.endswith(f'{user}{interface}'): # warning: needs to fix state too !!!
            #     send_msg = send_msg[:-len(f'{user}{interface}')].strip()
            #     break
            # if send_msg.endswith(f'{bot}{interface}'):
            #     send_msg = send_msg[:-len(f'{bot}{interface}')].strip()
            #     break
        # print(f'{model_tokens}')
        # print(f'[{tokenizer.tokenizer.decode(model_tokens)}]')
        # print(f'### send ###\n[{send_msg}]')
        # reply_msg(send_msg)
        save_all_stat(srv, 'chat', out)
 print(HELP_MSG)
 while True:
    msg = input(f'{user}{interface} ')
    if len(msg.strip()) > 0:
        on_message(msg)
    else:
        print('Erorr: please say something')
--- a/RWKV-v4neo/cuda/wkv_cuda.cu
+++ b/RWKV-v4neo/cuda/wkv_cuda.cu
@ -18,33 +18,28 @@ __global__ void kernel_forward(const int B, const int T, const int C,
    const F *__restrict__ const v = _v + _offset;
    F *__restrict__ const y = _y + _offset;
-    // aa and bb are running sums divided by exp(pp) (to avoid overflow)
+    F p = 0, q = 0, o = MIN_VALUE;
-    F aa = 0, bb = 0, pp = MIN_VALUE;
+    // p and q are running sums divided by exp(o) (to avoid overflows)
    for (int i = 0; i < T; i++) {
        const int ii = i * C;
-        const F kk = k[ii];
+
-        const F vv = v[ii];
+        F no = max(o, u + k[ii]);
-
+        F A = exp(o - no);
-        F ww = u + kk;
+        F B = exp(u + k[ii] - no);
-        F p = max(pp, ww);
+        y[ii] = (A * p + B * v[ii]) / (A * q + B);
-        F e1 = exp(pp - p);
+
-        F e2 = exp(ww - p);
+        no = max(w + o, k[ii]);
-        y[ii] = (e1 * aa + e2 * vv) / (e1 * bb + e2);
+        A = exp(w + o - no);
-        
+        B = exp(k[ii] - no);
-        ww = w + pp;
+        p = A * p + B * v[ii];
-        p = max(ww, kk);
+        q = A * q + B;
-        e1 = exp(ww - p);
+        o = no;
        e2 = exp(kk - p);
        aa = e1 * aa + e2 * vv;
        bb = e1 * bb + e2;
        pp = p;
    }
 }
 template <typename F>
 __global__ void kernel_backward(const int B, const int T, const int C,
-                                const F *__restrict__ const _w, const F *__restrict__ const _u, const F *__restrict__ const _k, const F *__restrict__ const _v,
+                                const F *__restrict__ const _w, const F *__restrict__ const _u, const F *__restrict__ const _k, const F *__restrict__ const _v, const F *__restrict__ const _gy,
                                const F *__restrict__ const _y, const F *__restrict__ const _gy,
                                F *__restrict__ const _gw, F *__restrict__ const _gu, F *__restrict__ const _gk, F *__restrict__ const _gv) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const int _b = idx / C;
@ -55,67 +50,64 @@ __global__ void kernel_backward(const int B, const int T, const int C,
    F w = _w[_c];
    const F *__restrict__ const k = _k + _offset;
    const F *__restrict__ const v = _v + _offset;
    const F *__restrict__ const y = _y + _offset;
    const F *__restrict__ const gy = _gy + _offset;
    F *__restrict__ const gk = _gk + _offset;
    F *__restrict__ const gv = _gv + _offset;
-    F q[Tmax], r[Tmax];
+    F y[Tmax], z[Tmax], zexp[Tmax];
-    F gw = 0, gu = 0, aa = 0, bb = 0, ga = 0, gb = 0, pp = MIN_VALUE;
+    F gw = 0, gu = 0;
    F p = 0, q = 0;
    F dpdw = 0, dqdw = 0;
    F o = MIN_VALUE;
    for (int i = 0; i < T; i++) {
        const int ii = i * C;
-        const F kk = k[ii];
+        F no = max(o, k[ii] + u);
-        const F vv = v[ii];
+        F A = exp(o - no);
-        const F yy = y[ii];
+        F B = exp(k[ii] + u - no);
-
+
-        F ww = u + kk;
+        F num = A * p + B * v[ii];
-        F p = max(pp, ww);
+        F iden = 1 / (A * q + B);
-        F e1 = exp(pp - p);
+
-        F e2 = exp(ww - p);
+        y[i] = num * iden;
-        const F qq = gy[ii] / (e1 * bb + e2);
+        z[i] = iden;
-        gw += (ga - gb * yy) * e1 * qq;
+        zexp[i] = k[ii] + u - no;
-        gu += (vv - yy) * e2 * qq;
+
-        q[i] = qq;
+        gw += gy[ii] * (dpdw - dqdw * y[i]) * iden * A;
-        r[i] = ww - p;
+        gu += gy[ii] * (v[ii] - y[i]) * B * iden;
-
+
-        ww = w + pp;
+        no = max(w + o, k[ii]);
-        p = max(ww, kk);
+        A = exp(w + o - no);
-        e1 = exp(ww - p);
+        B = exp(k[ii] - no);
-        e2 = exp(kk - p);
+        dpdw = A * (p + dpdw);
-        ga = e1 * (aa + ga);
+        dqdw = A * (q + dqdw);
-        gb = e1 * (bb + gb);
+        p = A * p + B * v[ii];
-        aa = e1 * aa + e2 * vv;
+        q = A * q + B;
-        bb = e1 * bb + e2;
+        o = no;
        pp = p;
    }
    const int _offsetBC = _b * C + _c;
    _gw[_offsetBC] = gw * _w[_c]; // multiply by w because of w -> -exp(w) in python forward()
    _gu[_offsetBC] = gu;
-    aa = 0, bb = 0, pp = MIN_VALUE;
+    F gp = 0, gq = 0;
    o = MIN_VALUE;
    for (int i = T - 1; i >= 0; i--) {
        const int ii = i * C;
-        const F kk = k[ii];
+        F A = gy[ii] * z[i] * exp(zexp[i]);
-        const F vv = v[ii];
+        F B = exp(k[ii] + o);
-        const F yy = y[ii];
+        gk[ii] = A * (v[ii] - y[i]) + B * (gp * v[ii] + gq);
-        const F qq = q[i];
+        gv[ii] = A + B * gp;
-        const F rr = r[i];
+
-
+        F no = max(w + o, zexp[i] - k[ii] - u);
-        F e1 = qq * exp(rr);
+        A = exp(w + o - no);
-        F e2 = exp(kk + pp);
+        B = gy[ii] * z[i] * exp(zexp[i] - k[ii] - u - no);
-        gk[ii] = e1 * (vv - yy) + e2 * (aa * vv + bb);
+        gp = A * gp + B;
-        gv[ii] = e1 + e2 * aa;
+        gq = A * gq - B * y[i];
-
+        o = no;
        const F ww = w + pp;
        const F www = rr - u - kk;
        const F p = max(ww, www);
        e1 = exp(ww - p);
        e2 = qq * exp(www - p);
        aa = e1 * aa + e2;
        bb = e1 * bb - e2 * yy;
        pp = p;
    }
    // Multiply by w because the w -> -exp(w) preprocessing is halfway in the backwards pass, even though it's not in the forward pass
    const int _offsetBC = _b * C + _c;
    _gw[_offsetBC] += gw * _w[_c];
    _gu[_offsetBC] += gu;
 }
 void cuda_forward(int B, int T, int C, float *w, float *u, float *k, float *v, float *y) {
@ -125,9 +117,9 @@ void cuda_forward(int B, int T, int C, float *w, float *u, float *k, float *v, f
    kernel_forward<<<numBlocks, threadsPerBlock>>>(B, T, C, w, u, k, v, y);
 }
-void cuda_backward(int B, int T, int C, float *w, float *u, float *k, float *v, float *y, float *gy, float *gw, float *gu, float *gk, float *gv) {
+void cuda_backward(int B, int T, int C, float *w, float *u, float *k, float *v, float *gy, float *gw, float *gu, float *gk, float *gv) {
    dim3 threadsPerBlock( min(C, 32) ); // requires --maxrregcount 60 for optimal performance
    assert(B * C % threadsPerBlock.x == 0);
    dim3 numBlocks(B * C / threadsPerBlock.x);
-    kernel_backward<<<numBlocks, threadsPerBlock>>>(B, T, C, w, u, k, v, y, gy, gw, gu, gk, gv);
+    kernel_backward<<<numBlocks, threadsPerBlock>>>(B, T, C, w, u, k, v, gy, gw, gu, gk, gv);
 }
--- a/RWKV-v4neo/cuda/wkv_cuda_bf16.cu
+++ b/RWKV-v4neo/cuda/wkv_cuda_bf16.cu
@ -1,132 +0,0 @@
 #include <stdio.h>
 #include <assert.h>
 #include "ATen/ATen.h"
 #define MIN_VALUE (-1e38)
 typedef at::BFloat16 bf16;
 __global__ void kernel_forward(const int B, const int T, const int C,
                               const float *__restrict__ const _w, const bf16 *__restrict__ const _u, const bf16 *__restrict__ const _k, const bf16 *__restrict__ const _v,
                               bf16 *__restrict__ const _y) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const int _b = idx / C;
    const int _c = idx % C;
    const int _offset = _b * T * C + _c;
    float u = float(_u[_c]);
    float w = _w[_c];
    const bf16 *__restrict__ const k = _k + _offset;
    const bf16 *__restrict__ const v = _v + _offset;
    bf16 *__restrict__ const y = _y + _offset;
    // aa and bb are running sums divided by exp(pp) (to avoid overflow)
    float aa = 0, bb = 0, pp = MIN_VALUE;
    for (int i = 0; i < T; i++) {
        const int ii = i * C;
        const float kk = float(k[ii]);
        const float vv = float(v[ii]);
        float ww = u + kk;
        float p = max(pp, ww);
        float e1 = exp(pp - p);
        float e2 = exp(ww - p);
        y[ii] = bf16((e1 * aa + e2 * vv) / (e1 * bb + e2));
        ww = w + pp;
        p = max(ww, kk);
        e1 = exp(ww - p);
        e2 = exp(kk - p);
        aa = e1 * aa + e2 * vv;
        bb = e1 * bb + e2;
        pp = p;
    }
 }
 __global__ void kernel_backward(const int B, const int T, const int C,
                                const float *__restrict__ const _w, const bf16 *__restrict__ const _u, const bf16 *__restrict__ const _k, const bf16 *__restrict__ const _v,
                                const bf16 *__restrict__ const _y, const bf16 *__restrict__ const _gy,
                                bf16 *__restrict__ const _gw, bf16 *__restrict__ const _gu, bf16 *__restrict__ const _gk, bf16 *__restrict__ const _gv) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const int _b = idx / C;
    const int _c = idx % C;
    const int _offset = _b * T * C + _c;
    float u = float(_u[_c]);
    float w = _w[_c];
    const bf16 *__restrict__ const k = _k + _offset;
    const bf16 *__restrict__ const v = _v + _offset;
    const bf16 *__restrict__ const y = _y + _offset;
    const bf16 *__restrict__ const gy = _gy + _offset;
    bf16 *__restrict__ const gk = _gk + _offset;
    bf16 *__restrict__ const gv = _gv + _offset;
    float q[Tmax], r[Tmax];
    float gw = 0, gu = 0, aa = 0, bb = 0, ga = 0, gb = 0, pp = MIN_VALUE;
    for (int i = 0; i < T; i++) {
        const int ii = i * C;
        const float kk = float(k[ii]);
        const float vv = float(v[ii]);
        const float yy = float(y[ii]);
        float ww = u + kk;
        float p = max(pp, ww);
        float e1 = exp(pp - p);
        float e2 = exp(ww - p);
        const float qq = float(gy[ii]) / (e1 * bb + e2);
        gw += (ga - gb * yy) * e1 * qq;
        gu += (vv - yy) * e2 * qq;
        q[i] = qq;
        r[i] = ww - p;
        ww = w + pp;
        p = max(ww, kk);
        e1 = exp(ww - p);
        e2 = exp(kk - p);
        ga = e1 * (aa + ga);
        gb = e1 * (bb + gb);
        aa = e1 * aa + e2 * vv;
        bb = e1 * bb + e2;
        pp = p;
    }
    const int _offsetBC = _b * C + _c;
    _gw[_offsetBC] = bf16(gw * _w[_c]); // multiply by w because of w -> -exp(w) in python forward()
    _gu[_offsetBC] = bf16(gu);
    aa = 0, bb = 0, pp = MIN_VALUE;
    for (int i = T - 1; i >= 0; i--) {
        const int ii = i * C;
        const float kk = float(k[ii]);
        const float vv = float(v[ii]);
        const float yy = float(y[ii]);
        const float qq = q[i];
        const float rr = r[i];
        float e1 = qq * exp(rr);
        float e2 = exp(kk + pp);
        gk[ii] = bf16(e1 * (vv - yy) + e2 * (aa * vv + bb));
        gv[ii] = bf16(e1 + e2 * aa);
        const float ww = w + pp;
        const float www = rr - u - kk;
        const float p = max(ww, www);
        e1 = exp(ww - p);
        e2 = qq * exp(www - p);
        aa = e1 * aa + e2;
        bb = e1 * bb - e2 * yy;
        pp = p;
    }
 }
 void cuda_forward(int B, int T, int C, float *w, bf16 *u, bf16 *k, bf16 *v, bf16 *y) {
    dim3 threadsPerBlock( min(C, 32) ); // requires --maxrregcount 60 for optimal performance
    assert(B * C % threadsPerBlock.x == 0);
    dim3 numBlocks(B * C / threadsPerBlock.x);
    kernel_forward<<<numBlocks, threadsPerBlock>>>(B, T, C, w, u, k, v, y);
 }
 void cuda_backward(int B, int T, int C, float *w, bf16 *u, bf16 *k, bf16 *v, bf16 *y, bf16 *gy, bf16 *gw, bf16 *gu, bf16 *gk, bf16 *gv) {
    dim3 threadsPerBlock( min(C, 32) ); // requires --maxrregcount 60 for optimal performance
    assert(B * C % threadsPerBlock.x == 0);
    dim3 numBlocks(B * C / threadsPerBlock.x);
    kernel_backward<<<numBlocks, threadsPerBlock>>>(B, T, C, w, u, k, v, y, gy, gw, gu, gk, gv);
 }
--- a/RWKV-v4neo/cuda/wkv_op.cpp
+++ b/RWKV-v4neo/cuda/wkv_op.cpp
@ -1,13 +1,13 @@
 #include <torch/extension.h>
 void cuda_forward(int B, int T, int C, float *w, float *u, float *k, float *v, float *y);
-void cuda_backward(int B, int T, int C, float *w, float *u, float *k, float *v, float *y, float *gy, float *gw, float *gu, float *gk, float *gv);
+void cuda_backward(int B, int T, int C, float *w, float *u, float *k, float *v, float *gy, float *gw, float *gu, float *gk, float *gv);
 void forward(int64_t B, int64_t T, int64_t C, torch::Tensor &w, torch::Tensor &u, torch::Tensor &k, torch::Tensor &v, torch::Tensor &y) {
    cuda_forward(B, T, C, w.data_ptr<float>(), u.data_ptr<float>(), k.data_ptr<float>(), v.data_ptr<float>(), y.data_ptr<float>());
 }
-void backward(int64_t B, int64_t T, int64_t C, torch::Tensor &w, torch::Tensor &u, torch::Tensor &k, torch::Tensor &v, torch::Tensor &y, torch::Tensor &gy, torch::Tensor &gw, torch::Tensor &gu, torch::Tensor &gk, torch::Tensor &gv) {
+void backward(int64_t B, int64_t T, int64_t C, torch::Tensor &w, torch::Tensor &u, torch::Tensor &k, torch::Tensor &v, torch::Tensor &gy, torch::Tensor &gw, torch::Tensor &gu, torch::Tensor &gk, torch::Tensor &gv) {
-    cuda_backward(B, T, C, w.data_ptr<float>(), u.data_ptr<float>(), k.data_ptr<float>(), v.data_ptr<float>(), y.data_ptr<float>(), gy.data_ptr<float>(), gw.data_ptr<float>(), gu.data_ptr<float>(), gk.data_ptr<float>(), gv.data_ptr<float>());
+    cuda_backward(B, T, C, w.data_ptr<float>(), u.data_ptr<float>(), k.data_ptr<float>(), v.data_ptr<float>(), gy.data_ptr<float>(), gw.data_ptr<float>(), gu.data_ptr<float>(), gk.data_ptr<float>(), gv.data_ptr<float>());
 }
 PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
--- a/RWKV-v4neo/cuda/wkv_op_bf16.cpp
+++ b/RWKV-v4neo/cuda/wkv_op_bf16.cpp
@ -1,25 +0,0 @@
 #include <torch/extension.h>
 #include "ATen/ATen.h"
 typedef at::BFloat16 bf16;
 void cuda_forward(int B, int T, int C, float *w, bf16 *u, bf16 *k, bf16 *v, bf16 *y);
 void cuda_backward(int B, int T, int C, float *w, bf16 *u, bf16 *k, bf16 *v, bf16 *y, bf16 *gy, bf16 *gw, bf16 *gu, bf16 *gk, bf16 *gv);
 void forward(int64_t B, int64_t T, int64_t C, torch::Tensor &w, torch::Tensor &u, torch::Tensor &k, torch::Tensor &v, torch::Tensor &y) {
    cuda_forward(B, T, C, w.data_ptr<float>(), u.data_ptr<bf16>(), k.data_ptr<bf16>(), v.data_ptr<bf16>(), y.data_ptr<bf16>());
 }
 void backward(int64_t B, int64_t T, int64_t C, torch::Tensor &w, torch::Tensor &u, torch::Tensor &k, torch::Tensor &v, torch::Tensor &y,
    torch::Tensor &gy, torch::Tensor &gw, torch::Tensor &gu, torch::Tensor &gk, torch::Tensor &gv) {
    cuda_backward(B, T, C, w.data_ptr<float>(), u.data_ptr<bf16>(), k.data_ptr<bf16>(), v.data_ptr<bf16>(), y.data_ptr<bf16>(),
        gy.data_ptr<bf16>(), gw.data_ptr<bf16>(), gu.data_ptr<bf16>(), gk.data_ptr<bf16>(), gv.data_ptr<bf16>());
 }
 PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &forward, "wkv forward");
    m.def("backward", &backward, "wkv backward");
 }
 TORCH_LIBRARY(wkv, m) {
    m.def("forward", forward);
    m.def("backward", backward);
 }
--- a/RWKV-v4neo/run.py
+++ b/RWKV-v4neo/run.py
@ -17,15 +17,14 @@ np.set_printoptions(precision=4, suppress=True, linewidth=200)
 args = types.SimpleNamespace()
 ########################################################################################################
-# Step 1: set model & config (use v4 to run your trained-from-scratch models. v4 and v4neo are compatible)
+# Step 1: set model & config
 # Do this first: pip install torchdynamo
 ########################################################################################################
-args.RUN_DEVICE = "cuda" # 'cuda' // 'cpu' (already fast)
+args.RUN_DEVICE = "cpu"  # 'cpu' (already very fast) // 'cuda'
-args.FLOAT_MODE = "fp16" # fp16 (good for GPU, does not work for CPU) // fp32 (good for CPU) // bf16 (less accurate, but works for CPU)
+args.FLOAT_MODE = "fp32" # fp32 (good for cpu) // fp16 (might overflow) // bf16 (less accurate)
 # if args.RUN_DEVICE == "cuda":
 #     os.environ["RWKV_RUN_BACKEND"] = 'nvfuser' # !!!BUGGY!!! wrong output
 os.environ["RWKV_JIT_ON"] = '1' # '1' or '0'. very useful for GPU/CPU fp32, but might be harmful for GPU fp16. please benchmark !!!
 TOKEN_MODE = "pile"
 WORD_NAME = [
@ -86,37 +85,27 @@ context = "\nIn a shocking finding, scientist discovered a herd of dragons livin
 # context = "\n深圳是" # test Chinese
 # context = "\n東京は" # test Japanese
-# ###### A good prompt for Q&A ######
+###### A good prompt for chatbot ######
 # context = '''
 # Questions & Helpful Answers
 # Ask Research Experts
 # Question:
 # Can penguins fly?
 # Full Answer:
 # '''
 # ###### A good prompt for chatbot ######
 # context = '''
-# The following is a conversation between a highly knowledgeable and intelligent AI assistant called Bot, and a human user called User. In the following interactions, User and Bot converse in natural language, and Bot always answer User's questions. Bot is very smart, polite and humorous. Bot knows a lot, and always tells the truth. The conversation begins.
+# The following is a conversation between a highly knowledgeable and intelligent AI assistant, called RWKV, and a human user, called User. In the following interactions, User and RWKV will converse in natural language, and RWKV will do its best to answer User’s questions. RWKV was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. The conversation begins.
-# User: who is president of usa?
+# User: OK RWKV, I’m going to start by quizzing you with a few warm-up questions. Who is currently the president of the USA?
-# Bot: It’s Joe Biden; he was sworn in earlier this year.
+# RWKV: It’s Joe Biden; he was sworn in earlier this year.
-# User: french revolution what year
+# User: What year was the French Revolution?
-# Bot: It started in 1789, but it lasted 10 years until 1799.
+# RWKV: It started in 1789, but it lasted 10 years until 1799.
-# User: guess i marry who ?
+# User: Can you guess who I might want to marry?
-# Bot: Only if you tell me more about yourself - what are your interests?
+# RWKV: Only if you tell me more about yourself - what are your interests?
-# User: wat is lhc
+# User: Aha, I’m going to refrain from that for now. Now for a science question. What can you tell me about the Large Hadron Collider (LHC)?
-# Bot: It’s a large and very expensive piece of science equipment. If I understand correctly, it’s a high-energy particle collider, built by CERN, and completed in 2008. They used it to confirm the existence of the Higgs boson in 2012.
+# RWKV: It’s a large and very expensive piece of science equipment. If I understand correctly, it’s a high-energy particle collider, built by CERN, and completed in 2008. They used it to confirm the existence of the Higgs boson in 2012.
-# User:''' # type your question here
+# User:'''
 NUM_TRIALS = 999
 LENGTH_PER_TRIAL = 333
@ -224,7 +213,7 @@ for TRIAL in range(1 if DEBUG_DEBUG else NUM_TRIALS):
            print(char, end="", flush=True)
        else:
            char = tokenizer.tokenizer.decode(ctx[out_last:])
-            if '\ufffd' not in char: # is valid utf8 string?
+            if '\ufffd' not in char:
                print(char, end="", flush=True)
                out_last = i+1
--- a/RWKV-v4neo/src/binidx.py
+++ b/RWKV-v4neo/src/binidx.py
@ -28,7 +28,7 @@ dtypes = {
    3: np.int16,
    4: np.int32,
    5: np.int64,
-    6: float,
+    6: np.float,
    7: np.double,
    8: np.uint16,
 }
@ -49,58 +49,6 @@ class MMapIndexedDataset(torch.utils.data.Dataset):
    class Index(object):
        _HDR_MAGIC = b"MMIDIDX\x00\x00"
        @classmethod
        def writer(cls, path, dtype):
            class _Writer(object):
                def __enter__(self):
                    self._file = open(path, "wb")
                    # Write Magic string so we can check the file format then opening it again.
                    self._file.write(cls._HDR_MAGIC)
                    # Write version number
                    # Little endian unsigned 64 Bit integer
                    self._file.write(struct.pack("<Q", 1))
                    # Little endian unsigned 8 Bit integer
                    self._file.write(struct.pack("<B", code(dtype)))
                    return self
                @staticmethod
                def _get_pointers(sizes):
                    dtype_size = dtype().itemsize
                    address = 0
                    pointers = []
                    for size in sizes:
                        pointers.append(address)
                        address += size * dtype_size
                    return pointers
                def write(self, sizes, doc_idx):
                    pointers = self._get_pointers(sizes)
                    # Little endian unsigned 64 Bit integer
                    self._file.write(struct.pack("<Q", len(sizes)))
                    # Little endian unsigned 64 Bit integer
                    self._file.write(struct.pack("<Q", len(doc_idx)))
                    sizes = np.array(sizes, dtype=np.int32)
                    self._file.write(sizes.tobytes(order="C"))
                    del sizes
                    pointers = np.array(pointers, dtype=np.int64)
                    self._file.write(pointers.tobytes(order="C"))
                    del pointers
                    doc_idx = np.array(doc_idx, dtype=np.int64)
                    self._file.write(doc_idx.tobytes(order="C"))
                def __exit__(self, exc_type, exc_val, exc_tb):
                    self._file.close()
            return _Writer()
        def __init__(self, path, skip_warmup=False):
            with open(path, "rb") as stream:
                magic_test = stream.read(9)
--- a/RWKV-v4neo/src/dataset.py
+++ b/RWKV-v4neo/src/dataset.py
@ -2,7 +2,7 @@
 # The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
 ########################################################################################################
-import json, math, random, os, sys
+import json, math, random
 import numpy as np
 import torch
 from torch.utils.data import Dataset
@ -16,53 +16,33 @@ class MyDataset(Dataset):
        self.args = args
        if args.data_type == "binidx":
            self.vocab_size = args.vocab_size
            rank_zero_info(f"Current vocab size = {self.vocab_size} (make sure it's correct)")
            if args.my_pile_version == 1:
            self.data = MMapIndexedDataset(args.data_file)
-                self.data_size = len(self.data._bin_buffer) // self.data._index._dtype_size
+            self.vocab_size = args.vocab_size
-                rank_zero_info(f"Data has {self.data_size} tokens.")
+            print("Current vocab size =", self.vocab_size, "(make sure it's correct)")
-            else:
+            self.data_size = len(self.data._bin_buffer) // 2
-                data_list = open(args.data_file, "r", encoding='utf-8').read().strip().split('\n')
+            print(f"Data has {self.data_size} tokens.")
                data_list = [i.strip().split(' ') for i in data_list]
                self.data = []
                self.data_size = int(data_list[-1][-1])
                rank_zero_info(f"Data has {self.data_size} chunks.")
                for d in data_list:
                    data = MMapIndexedDataset(d[0])
                    data_size = len(data._bin_buffer) // data._index._dtype_size
                    assert (data_size - args.ctx_len) == int(d[1])
                    self.data += [[int(d[-1]), int(d[1]), data]]
                # rank_zero_info(self.data)
            if args.my_qa_mask > 0:
                self.data_pile = MMapIndexedDataset('/fsx/pile/pile_20B_tokenizer_text_document')
                # self.data_pile = MMapIndexedDataset('/fsx/pile_deduped/pile_0.87_deduped_text_document')
                self.data_pile_size = len(self.data_pile._bin_buffer) // self.data._index._dtype_size
            if args.my_pile_stage > 0:
-                # assert self.data_size == 332115325534 and self.vocab_size == 50277
+                assert self.data_size == 332115325534 and self.vocab_size == 50277
                self.samples_per_epoch = args.epoch_steps * args.real_bsz
                assert self.samples_per_epoch == 40320
-                rank_zero_info(f"########## Pile 20b-tokenized stage {args.my_pile_stage} ##########")
+                print(f"########## Pile 20b-tokenized stage {args.my_pile_stage} ##########")
                dataset_slot = self.data_size // args.ctx_len
                if args.my_pile_stage != 4:
                assert MaybeIsPrime(args.magic_prime)
                assert args.magic_prime % 3 == 2
-                    assert args.magic_prime / dataset_slot > 0.99 and args.magic_prime / dataset_slot <= 1
+                assert args.magic_prime / dataset_slot > 0.999999 and args.magic_prime / dataset_slot <= 1
        elif args.data_type == "numpy":
            self.data = np.load(args.data_file).astype("int")
            self.vocab_size = args.vocab_size
-            rank_zero_info("Current vocab size =", self.vocab_size, "(make sure it's correct)")
+            print("Current vocab size =", self.vocab_size, "(make sure it's correct)")
            self.data_size = len(self.data)
-            rank_zero_info(f"Data has {self.data_size} tokens.")
+            print(f"Data has {self.data_size} tokens.")
        elif args.data_type == "uint16":
            self.data = np.fromfile(args.data_file, dtype=np.uint16).astype("int32").reshape(-1, args.my_sample_len)
            self.vocab_size = args.vocab_size
-            rank_zero_info("Current vocab size =", self.vocab_size, "(make sure it's correct)")
+            print("Current vocab size =", self.vocab_size, "(make sure it's correct)")
            self.data_size = self.data.shape[0]
-            rank_zero_info(f"Data has {self.data_size} samples.")
+            print(f"Data has {self.data_size} samples.")
        elif args.data_type == "wds_img":
            self.vocab_size = -1
            self.data_size = -1
@ -70,7 +50,7 @@ class MyDataset(Dataset):
            self.error_count = 0
        else:
            if args.data_type == "dummy":
-                rank_zero_info("Building dummy data...")
+                print("Building dummy data...")
                self.data = ""
                for i in range(100000):
                    aa = (i) % 10000
@ -79,13 +59,13 @@ class MyDataset(Dataset):
                    self.data += f".{aa}+{bb}={cc}."
            else:
                self.data = open(args.data_file, "r", encoding=args.data_type).read()
-            rank_zero_info("Building token list...")
+            print("Building token list...")
            unique = sorted(list(set(self.data)))
            self.vocab_size = len(unique)
-            # rank_zero_info()
+            # print()
            # for u in unique:
            #     print(u, end=' ')
-            # rank_zero_info('\n\n')
+            # print('\n\n')
            xx = 0
            xxObj = {}
            for u in unique:
@ -94,7 +74,7 @@ class MyDataset(Dataset):
            with open(f"{args.proj_dir}/vocab.json", "w", encoding="utf-16le") as vocab_file:
                vocab_file.write(json.dumps(xxObj, ensure_ascii=False))
            self.data_size = len(self.data)
-            rank_zero_info(f"Data has {self.data_size} tokens, {self.vocab_size} vocab size.")
+            print("Data has %d tokens, %d vocab size." % (self.data_size, self.vocab_size))
            self.stoi = {ch: i for i, ch in enumerate(unique)}
            self.itos = {i: ch for i, ch in enumerate(unique)}
@ -156,85 +136,25 @@ class MyDataset(Dataset):
            else:
                ctx_len = args.ctx_len
                req_len = ctx_len + 1
                magic_prime = args.magic_prime
                data = self.data
-                if args.my_pile_stage > 0 and args.my_pile_stage != 4:
+                if args.my_pile_stage > 0:
                    ii = 1 + epoch * self.samples_per_epoch + (idx * world_size) + rank
                    if args.my_qa_mask > 0:
                        ii_orig = ii
                        if ii % 2 == 0:
                            ii = -1
                            data = self.data_pile
                        else:
                            ii = ii // 2
                    if ii < 0:
                        i = np.random.randint(0, self.data_pile_size - req_len)
                    else:
                    factor = (math.sqrt(5) - 1) / 2
-                        factor = int(magic_prime * factor)
+                    factor = int(args.magic_prime * factor)
-                        i = ((factor * ii * ii * ii) % magic_prime) * ctx_len
+                    i = ((factor * ii * ii * ii) % args.magic_prime) * ctx_len
                    i = i + args.my_pile_shift
                    # print(f"epoch {epoch} idx {idx} rank {rank}/{world_size} ii {ii} pos {round(i / self.data_size, 3)}")
                elif args.my_pile_stage == 4:
                    # cheat: pick a random spot in dataset
                    if args.my_pile_version == 1:
                        i = np.random.randint(0, self.data_size - req_len)
                    else:
                        i = np.random.randint(0, self.data_size)
                else:
                    # cheat: pick a random spot in dataset
                    i = np.random.randint(0, self.data_size - req_len)
                if args.data_type == "binidx":
-                    if args.my_pile_version == 1:
+                    dix = self.data.get(idx=0, offset=i, length=req_len).astype(int)
                        dix = data.get(idx=0, offset=i, length=req_len).astype(int)
                    else:
                        # self.data : cutoff, chunk_count, data
                        for j in range(len(data)):
                            if i < data[j][0]:
                                ii = i
                                i = (i - (data[j-1][0] if j > 0 else 0)) % data[j][1]
                                dix = data[j][2].get(idx=0, offset=i, length=req_len).astype(int)
                                # print(ii, j, i)
                                break
                elif args.data_type == "numpy":
-                    dix = data[i : i + req_len]
+                    dix = self.data[i : i + req_len]
                else:
-                    dix = [self.stoi[s] for s in data[i : i + req_len]]
+                    dix = [self.stoi[s] for s in self.data[i : i + req_len]]
                if args.my_qa_mask == 1:
                    if data == self.data_pile:
                        z = [1] * ctx_len
                    else:
                        z = [0] * ctx_len
                        z_sum = 0
                        isGood = False
                        for i in range(3, ctx_len):
                            if dix[i] == 27 and dix[i-1] == 34 and dix[i-2] == 187 and dix[i-3] == 187:
                                isGood = True
                            if dix[i] == 0:
                                isGood = False
                            if isGood:
                                z[i] = 1
                                z_sum += 1
                        if z_sum == 0:
                            z = [1] * ctx_len
                            i = np.random.randint(0, self.data_pile_size - req_len)
                            dix = self.data_pile.get(idx=0, offset=i, length=req_len).astype(int)
                    z = torch.tensor(z, dtype=torch.bfloat16)
                x = torch.tensor(dix[:-1], dtype=torch.long)
                y = torch.tensor(dix[1:], dtype=torch.long)
                # if ii_orig < 50:
                #     # if rank == 1:
                #     print('rank', rank, 'i', ii_orig, ii, i, 'x', x[:5], '...', x[-5:])
                # else:
                #     exit(0)
                if args.my_qa_mask == 1:
                    return x, y, z
            return x, y
--- a/RWKV-v4neo/src/model.py
+++ b/RWKV-v4neo/src/model.py
@ -2,25 +2,18 @@
 # The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
 ########################################################################################################
-import os, math, gc, importlib
+import os, math, gc
 import torch
 # torch._C._jit_set_profiling_executor(True)
 # torch._C._jit_set_profiling_mode(True)
 import torch.nn as nn
 from torch.nn import functional as F
 import pytorch_lightning as pl
 from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
 from pytorch_lightning.strategies import DeepSpeedStrategy
-if importlib.util.find_spec('deepspeed'):
+import deepspeed
-    import deepspeed
+from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
    from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
 # from deepspeed.runtime.fp16.onebit.zoadam import ZeroOneAdam
 try:
    print('RWKV_MY_TESTING', os.environ["RWKV_MY_TESTING"])
 except:
    os.environ["RWKV_MY_TESTING"] = ''
 def __nop(ob):
    return ob
@ -42,43 +35,10 @@ T_MAX = int(os.environ["RWKV_T_MAX"])  # TAKES LOTS OF VRAM!
 from torch.utils.cpp_extension import load
-if os.environ["RWKV_FLOAT_MODE"] == "bf16":
+wkv_cuda = load(name=f"wkv_{T_MAX}", sources=["cuda/wkv_op.cpp", "cuda/wkv_cuda.cu"], verbose=True, extra_cuda_cflags=["-res-usage", "--maxrregcount 60", "--use_fast_math", "-O3", "-Xptxas -O3", f"-DTmax={T_MAX}"])
-    wkv_cuda = load(name=f"wkv_{T_MAX}_bf16", sources=["cuda/wkv_op_bf16.cpp", "cuda/wkv_cuda_bf16.cu"], verbose=True, extra_cuda_cflags=["-t 4", "-std=c++17", "-res-usage", "--maxrregcount 60", "--use_fast_math", "-O3", "-Xptxas -O3", "--extra-device-vectorization", f"-DTmax={T_MAX}"])
+
-    class WKV(torch.autograd.Function):
+
-        @staticmethod
+class WKV(torch.autograd.Function):
        def forward(ctx, B, T, C, w, u, k, v):
            ctx.B = B
            ctx.T = T
            ctx.C = C
            assert T <= T_MAX
            assert B * C % min(C, 32) == 0
            w = -torch.exp(w.float().contiguous())
            u = u.contiguous()
            k = k.contiguous()
            v = v.contiguous()
            y = torch.empty((B, T, C), device=w.device, memory_format=torch.contiguous_format, dtype=torch.bfloat16)
            wkv_cuda.forward(B, T, C, w, u, k, v, y)
            ctx.save_for_backward(w, u, k, v, y)
            return y
        @staticmethod
        def backward(ctx, gy):
            B = ctx.B
            T = ctx.T
            C = ctx.C
            assert T <= T_MAX
            assert B * C % min(C, 32) == 0
            w, u, k, v, y = ctx.saved_tensors
            gw = torch.empty((B, C), device=gy.device, memory_format=torch.contiguous_format, dtype=torch.bfloat16)
            gu = torch.empty((B, C), device=gy.device, memory_format=torch.contiguous_format, dtype=torch.bfloat16)
            gk = torch.empty((B, T, C), device=gy.device, memory_format=torch.contiguous_format, dtype=torch.bfloat16)
            gv = torch.empty((B, T, C), device=gy.device, memory_format=torch.contiguous_format, dtype=torch.bfloat16)
            wkv_cuda.backward(B, T, C, w, u, k, v, y, gy.contiguous(), gw, gu, gk, gv)
            gw = torch.sum(gw, dim=0)
            gu = torch.sum(gu, dim=0)
            return (None, None, None, gw, gu, gk, gv)
 else:
    wkv_cuda = load(name=f"wkv_{T_MAX}", sources=["cuda/wkv_op.cpp", "cuda/wkv_cuda.cu"], verbose=True, extra_cuda_cflags=["-res-usage", "--maxrregcount 60", "--use_fast_math", "-O3", "-Xptxas -O3", "--extra-device-vectorization", f"-DTmax={T_MAX}"])
    class WKV(torch.autograd.Function):
    @staticmethod
    def forward(ctx, B, T, C, w, u, k, v):
        ctx.B = B
@ -96,15 +56,16 @@ else:
            u = u.float().contiguous()
            k = k.float().contiguous()
            v = v.float().contiguous()
        ctx.save_for_backward(w, u, k, v)
        y = torch.empty((B, T, C), device=w.device, memory_format=torch.contiguous_format)
        wkv_cuda.forward(B, T, C, w, u, k, v, y)
            ctx.save_for_backward(w, u, k, v, y)
        if "32" in os.environ["RWKV_FLOAT_MODE"]:
            return y
        elif os.environ["RWKV_FLOAT_MODE"] == "fp16":
            return y.half()
        elif os.environ["RWKV_FLOAT_MODE"] == "bf16":
            return y.bfloat16()
    @staticmethod
    def backward(ctx, gy):
        B = ctx.B
@ -112,15 +73,15 @@ else:
        C = ctx.C
        assert T <= T_MAX
        assert B * C % min(C, 32) == 0
-            w, u, k, v, y = ctx.saved_tensors
+        w, u, k, v = ctx.saved_tensors
-            gw = torch.empty((B, C), device=gy.device, memory_format=torch.contiguous_format)
+        gw = torch.zeros((B, C), device=gy.device).contiguous()
-            gu = torch.empty((B, C), device=gy.device, memory_format=torch.contiguous_format)
+        gu = torch.zeros((B, C), device=gy.device).contiguous()
-            gk = torch.empty((B, T, C), device=gy.device, memory_format=torch.contiguous_format)
+        gk = torch.zeros((B, T, C), device=gy.device).contiguous()
-            gv = torch.empty((B, T, C), device=gy.device, memory_format=torch.contiguous_format)
+        gv = torch.zeros((B, T, C), device=gy.device).contiguous()
        if "32" in os.environ["RWKV_FLOAT_MODE"]:
-                wkv_cuda.backward(B, T, C, w, u, k, v, y, gy.contiguous(), gw, gu, gk, gv)
+            wkv_cuda.backward(B, T, C, w, u, k, v, gy.contiguous(), gw, gu, gk, gv)
        else:
-                wkv_cuda.backward(B, T, C, w, u, k, v, y, gy.float().contiguous(), gw, gu, gk, gv)
+            wkv_cuda.backward(B, T, C, w, u, k, v, gy.float().contiguous(), gw, gu, gk, gv)
        gw = torch.sum(gw, dim=0)
        gu = torch.sum(gu, dim=0)
        if "32" in os.environ["RWKV_FLOAT_MODE"]:
@ -148,159 +109,102 @@ class RWKV_TimeMix(MyModule):
        self.ctx_len = args.ctx_len
        self.n_embd = args.n_embd
        attn_sz = args.n_embd
        with torch.no_grad():  # fancy init
            ratio_0_to_1 = layer_id / (args.n_layer - 1)  # 0 to 1
            ratio_1_to_almost0 = 1.0 - (layer_id / args.n_layer)  # 1 to ~0
            ddd = torch.ones(1, 1, args.n_embd)
            for i in range(args.n_embd):
                ddd[0, 0, i] = i / args.n_embd
            # fancy time_decay
-            decay_speed = torch.ones(args.dim_att)
+            decay_speed = torch.ones(attn_sz)
-            for h in range(args.dim_att):
+            for h in range(attn_sz):
-                decay_speed[h] = -5 + 8 * (h / (args.dim_att - 1)) ** (0.7 + 1.3 * ratio_0_to_1)
+                decay_speed[h] = -5 + 8 * (h / (attn_sz - 1)) ** (0.7 + 1.3 * ratio_0_to_1)
            self.time_decay = nn.Parameter(decay_speed)
            # print(layer_id, self.time_decay.flatten()[:3].cpu().numpy(), '...', self.time_decay.flatten()[-3:].cpu().numpy())
            # fancy time_first
-            zigzag = torch.tensor([(i + 1) % 3 - 1 for i in range(args.dim_att)]) * 0.5
+            zigzag = torch.tensor([(i + 1) % 3 - 1 for i in range(attn_sz)]) * 0.5
-            self.time_first = nn.Parameter(torch.ones(args.dim_att) * math.log(0.3) + zigzag)
+            self.time_first = nn.Parameter(torch.ones(attn_sz) * math.log(0.3) + zigzag)
            # fancy time_mix
-            self.time_mix_k = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0))
+            x = torch.ones(1, 1, args.n_embd)
-            self.time_mix_v = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0) + 0.3 * ratio_0_to_1)
+            for i in range(args.n_embd):
-            self.time_mix_r = nn.Parameter(torch.pow(ddd, 0.5 * ratio_1_to_almost0))
+                x[0, 0, i] = i / args.n_embd
            self.time_mix_k = nn.Parameter(torch.pow(x, ratio_1_to_almost0))
            self.time_mix_v = nn.Parameter(torch.pow(x, ratio_1_to_almost0) + 0.3 * ratio_0_to_1)
            self.time_mix_r = nn.Parameter(torch.pow(x, 0.5 * ratio_1_to_almost0))
        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
-        self.key = nn.Linear(args.n_embd, args.dim_att, bias=False)
+
-        self.value = nn.Linear(args.n_embd, args.dim_att, bias=False)
+        self.key = nn.Linear(args.n_embd, attn_sz, bias=False)
-        self.receptance = nn.Linear(args.n_embd, args.dim_att, bias=False)
+        self.value = nn.Linear(args.n_embd, attn_sz, bias=False)
-        self.output = nn.Linear(args.dim_att, args.n_embd, bias=False)
+        self.receptance = nn.Linear(args.n_embd, attn_sz, bias=False)
-
+
-        if 'a' in os.environ["RWKV_MY_TESTING"]:
+        self.output = nn.Linear(attn_sz, args.n_embd, bias=False)
-            self.register_buffer("att_mask", torch.tril(torch.ones(args.ctx_len, args.ctx_len)))
+
            d_qkv = args.n_embd // 16
            self.qq = nn.Linear(args.n_embd, d_qkv, bias=False)
            self.kk = nn.Linear(args.n_embd, d_qkv, bias=False)
            self.vv = nn.Linear(args.n_embd, d_qkv, bias=False)
            self.oo = nn.Linear(d_qkv, args.n_embd, bias=False)
            with torch.no_grad():
                self.time_mix_qq = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0))
                self.time_mix_kk = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0))
                self.time_mix_vv = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0) + 0.3 * ratio_0_to_1)
    if 'a' not in os.environ["RWKV_MY_TESTING"]:
    @MyFunction
    def jit_func(self, x):
-            xx = self.time_shift(x) # Mix x with the previous timestep to produce xk, xv, xr
+
        # Mix x with the previous timestep to produce xk, xv, xr
        xx = self.time_shift(x)
        xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
        xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
        xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
        # Use xk, xv, xr to produce k, v, r
        k = self.key(xk)
        v = self.value(xv)
        r = self.receptance(xr)
        sr = torch.sigmoid(r)
        return sr, k, v
    def forward(self, x):
        B, T, C = x.size()  # x = (Batch,Time,Channel)
            sr, k, v = self.jit_func(x)
            rwkv = sr * RUN_CUDA(B, T, self.args.dim_att, self.time_decay, self.time_first, k, v)
            return self.output(rwkv)
    if 'a' in os.environ["RWKV_MY_TESTING"]:
        @MyFunction
        def QKV(self, q, k, v):
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.att_mask == 0, float('-inf'))
            att = F.softmax(att, dim = -1)
            x = att @ v
            return x
-        @MyFunction
+        sr, k, v = self.jit_func(x)
        def jit_funcQKV(self, x):
            xx = self.time_shift(x) # Mix x with the previous timestep to produce xk, xv, xr
            xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
            xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
            xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
            xqq = x * self.time_mix_qq + xx * (1 - self.time_mix_qq)
            xkk = x * self.time_mix_kk + xx * (1 - self.time_mix_kk)
            xvv = x * self.time_mix_vv + xx * (1 - self.time_mix_vv)
            k = self.key(xk)
            v = self.value(xv)
            r = self.receptance(xr)
            sr = torch.sigmoid(r)
            qq = self.qq(xqq)
            kk = self.kk(xkk)
            vv = self.vv(xvv)
            return sr, k, v, qq, kk, vv
-        def forward(self, x):
+        rwkv = sr * RUN_CUDA(B, T, C, self.time_decay, self.time_first, k, v)
-            B, T, C = x.size()  # x = (Batch,Time,Channel)
+        rwkv = self.output(rwkv)
            sr, k, v, qq, kk, vv = self.jit_funcQKV(x)
            rwkv = sr * RUN_CUDA(B, T, self.args.dim_att, self.time_decay, self.time_first, k, v)
            rwkv = self.output(rwkv) + self.oo(self.QKV(qq, kk, vv))
        return rwkv
 ########################################################################################################
 class RWKV_ChannelMix(MyModule):
    def __init__(self, args, layer_id):
        super().__init__()
        self.args = args
        self.layer_id = layer_id
        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
        with torch.no_grad():  # fancy init of time_mix
            ratio_1_to_almost0 = 1.0 - (layer_id / args.n_layer)  # 1 to ~0
-            ddd = torch.ones(1, 1, args.n_embd)
+
            x = torch.ones(1, 1, args.n_embd)
            for i in range(args.n_embd):
-                ddd[0, 0, i] = i / args.n_embd
+                x[0, 0, i] = i / args.n_embd
            self.time_mix_k = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0))
            self.time_mix_r = nn.Parameter(torch.pow(ddd, ratio_1_to_almost0))
-        self.key = nn.Linear(args.n_embd, args.dim_ffn, bias=False)
+            self.time_mix_k = nn.Parameter(torch.pow(x, ratio_1_to_almost0))
            self.time_mix_r = nn.Parameter(torch.pow(x, ratio_1_to_almost0))
        hidden_sz = 4 * args.n_embd
        self.key = nn.Linear(args.n_embd, hidden_sz, bias=False)
        self.receptance = nn.Linear(args.n_embd, args.n_embd, bias=False)
-        self.value = nn.Linear(args.dim_ffn, args.n_embd, bias=False)
+        self.value = nn.Linear(hidden_sz, args.n_embd, bias=False)
    @MyFunction
    def forward(self, x):
        xx = self.time_shift(x)
        xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
        xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
        k = self.key(xk)
        k = torch.square(torch.relu(k))
        kv = self.value(k)
        return torch.sigmoid(self.receptance(xr)) * kv
-class MishGLU(MyModule):
+        rkv = torch.sigmoid(self.receptance(xr)) * kv
-    def __init__(self, args, layer_id):
+        return rkv
        super().__init__()
        self.args = args
        self.layer_id = layer_id
        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
        with torch.no_grad():
            ratio_1_to_almost0 = 1.0 - (layer_id / args.n_layer)
            x = torch.ones(1, 1, args.n_embd)
            for i in range(args.n_embd):
                x[0, 0, i] = i / args.n_embd
            self.time_mix_k = nn.Parameter(torch.pow(x, ratio_1_to_almost0))
            self.time_mix_r = nn.Parameter(torch.pow(x, ratio_1_to_almost0))
            self.aa = nn.Linear(args.n_embd, args.dim_ffn, bias=False)
            self.bb = nn.Linear(args.n_embd, args.dim_ffn, bias=False)
            self.value = nn.Linear(args.dim_ffn, args.n_embd, bias=False)
    @MyFunction
    def forward(self, x):
        xx = self.time_shift(x)
        xa = x * self.time_mix_k + xx * (1 - self.time_mix_k)
        xb = x * self.time_mix_r + xx * (1 - self.time_mix_r)
        a = self.aa(xa)
        b = self.bb(xb)
        return self.value(a * F.mish(b))
 ########################################################################################################
 # The RWKV Model with our blocks
@ -327,9 +231,6 @@ class Block(nn.Module):
        else:
            self.att = RWKV_TimeMix(args, layer_id)
        if 'g' in os.environ["RWKV_MY_TESTING"]:
            self.ffn = MishGLU(args, layer_id)
        else:
        self.ffn = RWKV_ChannelMix(args, layer_id)
        if args.tiny_att_dim > 0 and self.layer_id == args.tiny_att_layer:
@ -385,14 +286,6 @@ class RWKV(pl.LightningModule):
    def __init__(self, args):
        super().__init__()
        self.args = args
        if not hasattr(args, 'dim_att'):
            args.dim_att = args.n_embd
        if not hasattr(args, 'dim_ffn'):
            args.dim_ffn = args.n_embd * 4
        if not hasattr(args, 'tiny_att_layer'):
            args.tiny_att_layer = -1
        if not hasattr(args, 'tiny_att_dim'):
            args.tiny_att_dim = -1
        self.emb = nn.Embedding(args.vocab_size, args.n_embd)
@ -508,38 +401,9 @@ class RWKV(pl.LightningModule):
    def training_step(self, batch, batch_idx):
        args = self.args
        if args.my_qa_mask != 1:
        idx, targets = batch
        logits = self(idx)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        else:
            idx, targets, mask = batch
            mask = mask.view(-1)
            sum_mask = torch.sum(mask).item()
            # if sum_mask == 0:
            #     return torch.tensor([0.0], requires_grad=True)
            logits = self(idx)
            if sum_mask == mask.shape[0]:
                loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
                # print('rank', self.global_rank, 'loss', loss.item())
            else:
                loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), reduction='none')
                # loss_raw = loss
                loss = torch.sum(loss * mask) / sum_mask
                # torch.set_printoptions(threshold=10000)
                # if True: #self.global_rank == 1:
                #     tmp = ''
                #     sss = 0
                #     ccc = 0
                #     for i in range(mask.shape[0]):
                #         if mask[i] > 0:
                #             tmp += str(idx.view(-1)[i].item()) + ','
                #             sss += loss_raw.view(-1)[i].float().item()
                #             ccc += 1
                #     print('rank', self.global_rank, 'loss', loss.item(), 'lavg', sss / ccc)#, 'tmp', tmp, 'input', idx)
        return L2Wrap.apply(loss, logits)
    def training_step_end(self, batch_parts):
@ -564,7 +428,7 @@ class RWKV(pl.LightningModule):
            gain = 1.0
            scale = 1.0
-            if "ln_" in n or ".ln" in n or "time_" in n or "_mask" in n or "pos_emb" in n or '.mask.' in n:
+            if "ln_" in n or ".ln" in n or "time_" in n or "_mask" in n or "pos_emb" in n:
                m[n] = p
            else:
                if n == "emb.weight":
@ -572,7 +436,7 @@ class RWKV(pl.LightningModule):
                else:
                    if shape[0] > shape[1]:
                        gain = math.sqrt(shape[0] / shape[1])
-                    for kk in [".att.key.", ".att.receptance.", ".att.output.", ".att.key.", ".ffn.value.", ".ffn.receptance.", ".ffnPre.value.", ".ffnPre.receptance.", "head_q.", '.oo.', '.rr.']:
+                    for kk in [".att.key.", ".att.receptance.", ".att.output.", ".att.key.", ".ffn.value.", ".ffn.receptance.", ".ffnPre.value.", ".ffnPre.receptance.", "head_q."]:
                        if kk in n:
                            scale = 0
                    if n == "head.weight":
--- a/RWKV-v4neo/src/model_run.py
+++ b/RWKV-v4neo/src/model_run.py
@ -9,22 +9,16 @@ from torch.nn import functional as F
 import torch.nn as nn
 from typing import List, Dict
-MyModule = nn.Module
+# try:
 #     import torchdynamo
 #     MyFunction = torchdynamo.optimize(os.environ["RWKV_RUN_BACKEND"]) # !!!BUGGY!!! wrong output
 # except:
 def __nop(ob):
    return ob
 MyFunction = __nop
 # # try torchdynamo
 # import torchdynamo
 # MyFunction = torchdynamo.optimize(os.environ["RWKV_RUN_BACKEND"]) # !!!BUGGY!!! wrong output
 # try torch jit --> faster for fp32, slower for fp16 (why?)
 if os.environ["RWKV_JIT_ON"] == "1":
    MyModule = torch.jit.ScriptModule
    MyFunction = torch.jit.script_method
 RWKV_HEAD_QK_DIM = 0
-print(f'\nRWKV_HEAD_QK_DIM {RWKV_HEAD_QK_DIM} RWKV_JIT_ON {os.environ["RWKV_JIT_ON"]}\n')
+print(f'\nRWKV_HEAD_QK_DIM {RWKV_HEAD_QK_DIM}\n')
 DEBUG_TIME = False   # True False - show trained time-coeffs
@ -32,7 +26,7 @@ RWKV_RESCALE_LAYER = 6 # set x=x/2 every X layer
 ############################################################################################################
-class RWKV_RNN(MyModule):
+class RWKV_RNN(nn.Module):
    def __init__(self, args):
        super().__init__()
@ -119,7 +113,7 @@ class RWKV_RNN(MyModule):
    # state[] 0=ffn_xx 1=att_xx 2=att_aa 3=att_bb 4=att_pp
    @MyFunction
-    def FF(self, x, state, i:int, time_mix_k, time_mix_r, kw, vw, rw):
+    def FF(self, x, state, i, time_mix_k, time_mix_r, kw, vw, rw):
        if self.FLOAT_MODE == "bf16":
            xk = x * time_mix_k + state[5*i+0].type(torch.bfloat16) * (1 - time_mix_k)
            xr = x * time_mix_r + state[5*i+0].type(torch.bfloat16) * (1 - time_mix_r)
@ -140,7 +134,7 @@ class RWKV_RNN(MyModule):
        return r * kv
    @MyFunction
-    def SA(self, x, state, i:int, time_mix_k, time_mix_v, time_mix_r, time_first, time_decay, kw, vw, rw, ow):
+    def SA(self, x, state, i, time_mix_k, time_mix_v, time_mix_r, time_first, time_decay, kw, vw, rw, ow):
        if self.FLOAT_MODE == "bf16":
            xk = x * time_mix_k + state[5*i+1].type(torch.bfloat16) * (1 - time_mix_k)
            xv = x * time_mix_v + state[5*i+1].type(torch.bfloat16) * (1 - time_mix_v)
--- a/RWKV-v4neo/src/trainer.py
+++ b/RWKV-v4neo/src/trainer.py
@ -1,17 +1,9 @@
-import os, math, time, datetime, subprocess
+import os, math, time, datetime
 import torch
 from torch.utils.data import DataLoader
 import pytorch_lightning as pl
 from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
 def my_save(dd, ff):
    if '14b-run1' not in ff:
        torch.save(dd, ff)
    else:
        fn = ff.split('/')[-1]
        fff = '/dev/shm/' + fn
        torch.save(dd, fff)
        subprocess.Popen(f" aws s3 mv {fff} s3://rwkv-14b-4k/{fn} --quiet", shell=True)
 class train_callback(pl.Callback):
    def __init__(self, args):
@ -105,15 +97,6 @@ class train_callback(pl.Callback):
                if kt_s > 0:
                    lll["kt/s"] = kt_s
                trainer.my_wandb.log(lll, step=int(real_step))
            if args.magic_prime > 0:
                expand_factor = 2 if args.my_qa_mask > 0 else 1
                if int(real_step) == int(args.magic_prime * expand_factor // args.real_bsz) - 1:
                    to_save_dict = pl_module.state_dict()
                    my_save(
                        to_save_dict,
                        f"{args.proj_dir}/rwkv-final.pth",
                    )
    def on_train_epoch_start(self, trainer, pl_module):
        args = self.args
@ -137,12 +120,12 @@ class train_callback(pl.Callback):
                else:
                    to_save_dict = pl_module.state_dict()
                try:
-                    my_save(
+                    torch.save(
                        to_save_dict,
                        f"{args.proj_dir}/rwkv-{args.epoch_begin + trainer.current_epoch}.pth",
                    )
-                except Exception as e:
+                except:
-                    print('Error\n\n', e, '\n\n')
+                    pass
            trainer.my_log.write(f"{args.epoch_begin + trainer.current_epoch} {trainer.my_epoch_loss:.6f} {math.exp(trainer.my_epoch_loss):.4f} {trainer.my_lr:.8f} {datetime.datetime.now()} {trainer.current_epoch}\n")
            trainer.my_log.flush()
@ -155,32 +138,14 @@ def generate_init_weight(model, init_weight_name):
    mm = model.generate_init_weight()
    if model.args.my_pile_stage == 1:
-        if len(model.args.load_model) > 0:
+        try:
            print(f"Combine weights from {model.args.load_model}...")
            load_dict = torch.load(model.args.load_model, map_location="cpu")
            for k in load_dict:
                assert k in mm
-                src = load_dict[k]
+                mm[k] = load_dict[k].reshape(mm[k].shape)
                try:
                    mm[k] = src.reshape(mm[k].shape)
        except:
-                    tmp = mm[k].squeeze().clone()
+            print(f"\n\n!!! FAIL !!!\n\n")
                    print(k, src.shape, '-->', mm[k].shape)
                    ss = src.shape[0]
                    dd = tmp.shape[0]
                    for i in range(dd):
                        pos = i / dd * ss
                        if pos >= ss - 1:
                            tmp[i] = src[ss-1]
                        else:
                            p0 = int(math.floor(pos))
                            ii = pos - p0
                            tmp[i] = src[p0] * (1-ii) + src[p0+1] * (ii)
                    mm[k] = tmp.reshape(mm[k].shape)
                    sss = src.squeeze().float().cpu().numpy()
                    print(sss[:10], '...', sss[-10:])
                    mmm = mm[k].squeeze().float().cpu().numpy()
                    print(mmm[:10], '...', mmm[-10:])
    print(f"Save to {init_weight_name}...")
    torch.save(mm, init_weight_name)
--- a/RWKV-v4neo/train.py
+++ b/RWKV-v4neo/train.py
@ -5,9 +5,8 @@
 if __name__ == "__main__":
    from argparse import ArgumentParser
    from pytorch_lightning import Trainer
    from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
-    rank_zero_info("########## work in progress ##########")
+    print("########## work in progress ##########")
    ########################################################################################################
    #
@ -67,8 +66,6 @@ if __name__ == "__main__":
    parser.add_argument("--micro_bsz", default=12, type=int)  # micro batch size (batch size per GPU)
    parser.add_argument("--n_layer", default=6, type=int)
    parser.add_argument("--n_embd", default=512, type=int)
    parser.add_argument("--dim_att", default=0, type=int)
    parser.add_argument("--dim_ffn", default=0, type=int)
    parser.add_argument("--pre_ffn", default=0, type=int)  # replace first att layer by ffn (sometimes better)
    parser.add_argument("--head_qk", default=0, type=int)  # my headQK trick
    parser.add_argument("--tiny_att_dim", default=0, type=int)  # tiny attention dim
@ -76,13 +73,12 @@ if __name__ == "__main__":
    parser.add_argument("--lr_init", default=6e-4, type=float)  # 6e-4 for L12-D768, 4e-4 for L24-D1024, 3e-4 for L24-D2048
    parser.add_argument("--lr_final", default=1e-5, type=float)
-    parser.add_argument("--warmup_steps", default=-1, type=int)  # try 50 if you load a model
+    parser.add_argument("--warmup_steps", default=0, type=int)  # try 50 if you load a model
    parser.add_argument("--beta1", default=0.9, type=float)
    parser.add_argument("--beta2", default=0.99, type=float)  # use 0.999 when your model is close to convergence
    parser.add_argument("--adam_eps", default=1e-8, type=float)
    parser.add_argument("--grad_cp", default=0, type=int)  # gradient checkpt: saves VRAM, but slower
-    parser.add_argument("--my_pile_version", default=1, type=int)  # my special pile version
+    parser.add_argument("--grad_cp", default=0, type=int)  # gradient checkpt: saves VRAM, but slower
    parser.add_argument("--my_pile_stage", default=0, type=int)  # my special pile mode
    parser.add_argument("--my_pile_shift", default=-1, type=int)  # my special pile mode - text shift
    parser.add_argument("--my_pile_edecay", default=0, type=int)
@ -103,23 +99,20 @@ if __name__ == "__main__":
    parser.add_argument("--my_att_shift", default=1, type=int)
    parser.add_argument("--my_pos_emb", default=0, type=int)
    parser.add_argument("--load_partial", default=0, type=int)
    parser.add_argument("--magic_prime", default=0, type=int)
    parser.add_argument("--my_qa_mask", default=0, type=int)
    parser.add_argument("--my_testing", default='', type=str)
    parser = Trainer.add_argparse_args(parser)
    args = parser.parse_args()
    ########################################################################################################
-    import os, warnings, math, datetime, sys, time, importlib
+    import os, warnings, math, datetime, sys, time
    import numpy as np
    import torch
    from torch.utils.data import DataLoader
    if "deepspeed" in args.strategy:
    import deepspeed
    import pytorch_lightning as pl
    from pytorch_lightning import seed_everything
    from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
    if args.random_seed >= 0:
        print(f"########## WARNING: GLOBAL SEED {args.random_seed} THIS WILL AFFECT MULTIGPU SAMPLING ##########\n" * 3)
@ -142,11 +135,6 @@ if __name__ == "__main__":
    args.betas = (args.beta1, args.beta2)
    args.real_bsz = int(args.num_nodes) * int(args.devices) * args.micro_bsz
    os.environ["RWKV_T_MAX"] = str(args.ctx_len)
    os.environ["RWKV_MY_TESTING"] = args.my_testing
    if args.dim_att <= 0:
        args.dim_att = args.n_embd
    if args.dim_ffn <= 0:
        args.dim_ffn = args.n_embd * 4
    if args.data_type == "wds_img":
        args.run_name = f"v{args.my_img_version}-{args.my_img_size}-{args.my_img_bit}bit-{args.my_img_clip}x{args.my_img_clip_scale}"
@ -157,9 +145,6 @@ if __name__ == "__main__":
        os.makedirs(args.proj_dir)
    if args.my_pile_stage > 0:
        magic_prime_bak = args.magic_prime
        if args.my_pile_version == 1:
        if args.ctx_len == 1024:
            args.magic_prime = 324331313
            args.epoch_count = 8043
@ -169,30 +154,13 @@ if __name__ == "__main__":
        elif args.ctx_len == 4096:
            args.magic_prime = 81082817
            args.epoch_count = 2010
-            elif args.ctx_len == 8192:
+        if args.my_pile_shift < 0:
                args.magic_prime = 40541399
                args.epoch_count = 1005
        else:
            if args.ctx_len == 1024:
-                args.magic_prime = 1694947181
+                args.my_pile_shift = 0
                args.epoch_count = 42036
            elif args.ctx_len == 2048:
-                args.magic_prime = 847473509
+                args.my_pile_shift = 512
                args.epoch_count = 21017
            elif args.ctx_len == 4096:
-                args.magic_prime = 423736637
+                args.my_pile_shift = 768
                args.epoch_count = 10508
            elif args.ctx_len == 6144:
                args.magic_prime = 282491051
                args.epoch_count = 7005
            elif args.ctx_len == 8192:
                args.magic_prime = 211868243
                args.epoch_count = 5253
        if args.my_pile_shift < 0:
            args.my_pile_shift = 0
        if magic_prime_bak > 0:
            args.magic_prime = magic_prime_bak
        args.epoch_steps = 40320 // args.real_bsz
        assert args.epoch_steps * args.real_bsz == 40320
@ -216,7 +184,6 @@ if __name__ == "__main__":
                args.load_model = f"{args.proj_dir}/rwkv-init.pth"
            else:
                args.load_model = f"{args.proj_dir}/rwkv-{max_p}.pth"
                if args.warmup_steps < 0:
                if args.my_pile_stage == 2:
                    args.warmup_steps = 10
                else:
@ -241,9 +208,9 @@ if __name__ == "__main__":
 #
 # Adam = lr {args.lr_init} to {args.lr_final}, warmup {args.warmup_steps} steps, beta {args.betas}, eps {args.adam_eps}
 #
-# Found torch {torch.__version__}, recommend 1.13.1+cu117 or newer
+# Found torch {torch.__version__}, recommend 1.12.1+cu116 or newer
-# Found deepspeed {deepspeed.__version__ if importlib.util.find_spec('deepspeed') else 'None'}, recommend 0.7.0 (faster than newer versions)
+# Found deepspeed {deepspeed.__version__}, recommend 0.7.0 (faster than newer versions)
-# Found pytorch_lightning {pl.__version__}, recommend 1.9.1 or newer
+# Found pytorch_lightning {pl.__version__}, recommend 1.7.4 or newer
 #
 ############################################################################
 """
@ -258,7 +225,6 @@ if __name__ == "__main__":
    assert args.precision in ["fp32", "tf32", "fp16", "bf16"]
    os.environ["RWKV_FLOAT_MODE"] = args.precision
    if args.precision == "fp32":
        for i in range(10):
        rank_zero_info("\n\nNote: you are using fp32 (very slow). Try bf16 / tf32 for faster training.\n\n")
    if args.precision == "fp16":
        rank_zero_info("\n\nNote: you are using fp16 (might overflow). Try bf16 / tf32 for stable training.\n\n")
@ -303,11 +269,11 @@ if __name__ == "__main__":
        generate_init_weight(model, init_weight_name)  # save initial weights
        args.load_model = init_weight_name
-    rank_zero_info(f"########## Loading {args.load_model}... ##########")
+    print(f"########## Loading {args.load_model}... ##########")
    try:
        load_dict = torch.load(args.load_model, map_location="cpu")
    except:
-        rank_zero_info(f"Bad checkpoint {args.load_model}")
+        print(f"Bad checkpoint {args.load_model}")
        if args.my_pile_stage >= 2:  # try again using another checkpoint
            max_p = args.my_pile_prev_p
            if max_p == -1:
@ -315,7 +281,7 @@ if __name__ == "__main__":
            else:
                args.load_model = f"{args.proj_dir}/rwkv-{max_p}.pth"
            args.epoch_begin = max_p + 1
-            rank_zero_info(f"Trying {args.load_model}")
+            print(f"Trying {args.load_model}")
            load_dict = torch.load(args.load_model, map_location="cpu")
    if args.load_partial == 1:
@ -329,16 +295,6 @@ if __name__ == "__main__":
        args,
        callbacks=[train_callback(args)],
    )
    if trainer.global_rank == 0:
        for n in model.state_dict():
            shape = model.state_dict()[n].shape
            shape = [i for i in shape if i != 1]
            if len(shape) > 1:
                print(f"{str(shape[0]).ljust(5)} {str(shape[1]).ljust(5)} {n}")
            else:
                print(f"{str(shape[0]).ljust(5)}       {n}")
    if "deepspeed" in args.strategy:
        trainer.strategy.config["zero_optimization"]["allgather_bucket_size"] = args.ds_bucket_mb * 1000 * 1000
        trainer.strategy.config["zero_optimization"]["reduce_bucket_size"] = args.ds_bucket_mb * 1000 * 1000
--- a/RWKV-v4neo/verify.py
+++ b/RWKV-v4neo/verify.py
@ -1,104 +0,0 @@
 ########################################################################################################
 # The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
 ########################################################################################################
 # this is for verifying the results of different models and make sure they agree with each other
 import os, sys, types
 import numpy as np
 import torch
 np.set_printoptions(precision=4, suppress=True, linewidth=200)
 try:
    os.environ["CUDA_VISIBLE_DEVICES"] = sys.argv[1]
 except:
    pass
 torch.backends.cudnn.benchmark = True
 torch.backends.cudnn.allow_tf32 = False
 torch.backends.cuda.matmul.allow_tf32 = False
 os.environ['RWKV_FLOAT_MODE'] = 'bf16' # bf16 or fp32
 os.environ['RWKV_RUN_DEVICE'] = 'cuda' # currently model_train requires CUDA
 RUN_DEVICE = os.environ['RWKV_RUN_DEVICE']
 TOKEN_MODE = 'pile'
 if TOKEN_MODE == 'pile':
    WORD_NAME = ['20B_tokenizer.json', '20B_tokenizer.json']
    MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-3b/RWKV-4-Pile-3B-20221003-6783'
    n_layer = 32
    n_embd = 2560
    ctx_len = 1024
    UNKNOWN_CHAR = None
 from src.utils import TOKENIZER
 tokenizer = TOKENIZER(WORD_NAME, UNKNOWN_CHAR=UNKNOWN_CHAR)
 if TOKEN_MODE == 'pile':
    tokenizer.vocab_size = 50277
 ########################################################################################################
 os.environ["RWKV_JIT_ON"] = "1"
 os.environ["RWKV_T_MAX"] = str(ctx_len)
 from src.model_run import RWKV_RNN
 from src.model import RWKV
 args = types.SimpleNamespace()
 args.vocab_size = tokenizer.vocab_size
 args.ctx_len = ctx_len
 args.n_embd = n_embd
 args.n_layer = n_layer
 args.head_qk = 0
 args.pre_ffn = 0
 args.grad_cp = 0
 args.my_pos_emb = 0
 model_train = RWKV(args).to(RUN_DEVICE)
 if os.environ['RWKV_FLOAT_MODE'] == 'fp16':
    model_train = model_train.half()
 elif os.environ['RWKV_FLOAT_MODE'] == 'bf16':
    model_train = model_train.bfloat16()
 print('loading ' + MODEL_NAME)
 m2 = torch.load(MODEL_NAME + '.pth', map_location='cpu')
 model_train.load_state_dict(m2)
 if os.environ['RWKV_FLOAT_MODE'] == 'fp16':
    model_train = model_train.half()
 elif os.environ['RWKV_FLOAT_MODE'] == 'bf16':
    model_train = model_train.bfloat16()
 args.MODEL_NAME = MODEL_NAME
 args.RUN_DEVICE = RUN_DEVICE
 args.FLOAT_MODE = os.environ['RWKV_FLOAT_MODE']
 model_rnn = RWKV_RNN(args)
 ########################################################################################################
 print(f"\nVerifying {os.environ['RWKV_RUN_DEVICE']} {os.environ['RWKV_FLOAT_MODE']}")
 # context = '\nIn a'
 context = '\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.'
 if TOKEN_MODE == 'pile':
    ctx = tokenizer.tokenizer.encode(context)
 print(f'input len {len(ctx)} data {ctx}')
 ########################################################################################################
 with torch.no_grad():
    print('\nRWKV-train output')
    out = model_train.forward(torch.tensor([ctx]).to(RUN_DEVICE))[0].detach().cpu().float().numpy()
    print(out, '\n')
    print('\nRWKV-RNN output')
    state = None
    out = None
    src_len = len(ctx)
    for i in range(src_len):
        x = ctx[:i+1]
        out, state = model_rnn.forward(x, state)
        if i < 3 or i >= src_len - 3:
            print(out.detach().cpu().numpy())
        if i == 2:
            print('...')