From 83a4512b74b4b16d549040a36590f6cc8deb1d5a Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Fri, 13 Jan 2023 20:03:04 +0800
Subject: [PATCH] Update README.md

---
 README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d8c6708..33c0c9e 100644
--- a/README.md
+++ b/README.md
@@ -104,12 +104,17 @@ Colab for fine-tuning RWKV-4 Pile models: https://colab.research.google.com/gith
 ```
 python tools/preprocess_data.py --input ./my_data.jsonl --output-prefix ./data/my_data --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod
 ```
-The jsonl format sample:
+The jsonl format sample (one line for each document):
 ```
 {"meta": {"ID": 101}, "text": "This is the first document."}
 {"meta": {"ID": 102}, "text": "Hello\nWorld"}
 {"meta": {"ID": 103}, "text": "1+1=2\n1+2=3\n2+2=4"}
 ```
+generated by code like this:
+```
+ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
+out.write(ss + "\n")
+```
 
 ## How it works