From 7a47bcd09663fe56a7dc4fe799543717afeb5f15 Mon Sep 17 00:00:00 2001 From: PENG Bo <33809201+BlinkDL@users.noreply.github.com> Date: Tue, 17 May 2022 02:24:24 +0800 Subject: [PATCH] Update README.md --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 3acbd24..aa1633b 100644 --- a/README.md +++ b/README.md @@ -104,6 +104,16 @@ In the last three plots, black = predicted loss curve of the new LR schedule, bl ![better_lr_schedule](Research/better_lr_schedule.png) +## How to sample a large dataset + +I am using a trick to sample the Pile deterministically yet randomly enough. + +Let's say the pile has x chunks (a chunk = ctx_len tokens). + +pick a prime number p just less than x, and make sure p = 2 (mod 3). + +Use (step * step * step) mod p to sample it. + ## The top-p-x sampling method We propose a new sampling method called top-p-x: