From 7a47bcd09663fe56a7dc4fe799543717afeb5f15 Mon Sep 17 00:00:00 2001
From: PENG Bo <33809201+BlinkDL@users.noreply.github.com>
Date: Tue, 17 May 2022 02:24:24 +0800
Subject: [PATCH] Update README.md

---
 README.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/README.md b/README.md
index 3acbd24..aa1633b 100644
--- a/README.md
+++ b/README.md
@@ -104,6 +104,16 @@ In the last three plots, black = predicted loss curve of the new LR schedule, bl
 
 ![better_lr_schedule](Research/better_lr_schedule.png)
 
+## How to sample a large dataset
+
+I am using a trick to sample the Pile deterministically yet randomly enough.
+
+Let's say the pile has x chunks (a chunk = ctx_len tokens).
+
+pick a prime number p just less than x, and make sure p = 2 (mod 3).
+
+Use (step * step * step) mod p to sample it. 
+
 ## The top-p-x sampling method
 
 We propose a new sampling method called top-p-x: