Update README.md

main
PENG Bo 3 years ago committed by GitHub
parent 59e6deeb58
commit 83a4512b74
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -104,12 +104,17 @@ Colab for fine-tuning RWKV-4 Pile models: https://colab.research.google.com/gith
``` ```
python tools/preprocess_data.py --input ./my_data.jsonl --output-prefix ./data/my_data --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod python tools/preprocess_data.py --input ./my_data.jsonl --output-prefix ./data/my_data --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod
``` ```
The jsonl format sample: The jsonl format sample (one line for each document):
``` ```
{"meta": {"ID": 101}, "text": "This is the first document."} {"meta": {"ID": 101}, "text": "This is the first document."}
{"meta": {"ID": 102}, "text": "Hello\nWorld"} {"meta": {"ID": 102}, "text": "Hello\nWorld"}
{"meta": {"ID": 103}, "text": "1+1=2\n1+2=3\n2+2=4"} {"meta": {"ID": 103}, "text": "1+1=2\n1+2=3\n2+2=4"}
``` ```
generated by code like this:
```
ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")
```
## How it works ## How it works

Loading…
Cancel
Save