diff --git a/README.md b/README.md index 8e5bfd5..ed556aa 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,12 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for The token-shift means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors. +``` +self.time_shift = nn.ZeroPad2d((0,0,1,-1)) + +x = torch.cat([self.time_shift(x[:, :, :C//2]), x[:, :, C//2:]], dim = -1) +``` + I found dividing channels by 2 and shift-1 works the best for Chinese LM. You may want to use more shift for English char-level LM. I checked the weights and found you may want to use less mixing in higher layers. My theory on the effectiveness of token-shift: @@ -64,6 +70,18 @@ We also propose a new sampling method called top-a (as in src/utils.py): (3) Feel free to tune the 0.02 and 2 factor. +The idea of top-a: +1. If max_prob=0.9, then remove all tokens with prob < 0.0162 (so, removing most alternatives) +2. If max_prob=0.5, then remove all tokens with prob < 0.0050 (so, allowing more choices) +3. If max_prob=0.1, then remove all tokens with prob < 0.0002 (so, allowing lots of possibilities) + +``` +probs = F.softmax(logits, dim=-1) + +limit = torch.pow(torch.max(probs), 2.0) * 0.02 +logits[probs < limit] = -float('Inf') +``` + # Performance Character-level loss on simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip