Update README.md

main
PENG Bo 4 years ago committed by GitHub
parent 710d3e34b7
commit 3d8d0373b4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -38,6 +38,12 @@ Moreover we multiply the final output of Time-mix layer by γ(t). The reason for
The token-shift means explicitly using both (half channel of this token) & (half channel of prev token) to generate all vectors.
```
self.time_shift = nn.ZeroPad2d((0,0,1,-1))
x = torch.cat([self.time_shift(x[:, :, :C//2]), x[:, :, C//2:]], dim = -1)
```
I found dividing channels by 2 and shift-1 works the best for Chinese LM. You may want to use more shift for English char-level LM. I checked the weights and found you may want to use less mixing in higher layers.
My theory on the effectiveness of token-shift:
@ -64,6 +70,18 @@ We also propose a new sampling method called top-a (as in src/utils.py):
(3) Feel free to tune the 0.02 and 2 factor.
The idea of top-a:
1. If max_prob=0.9, then remove all tokens with prob < 0.0162 (so, removing most alternatives)
2. If max_prob=0.5, then remove all tokens with prob < 0.0050 (so, allowing more choices)
3. If max_prob=0.1, then remove all tokens with prob < 0.0002 (so, allowing lots of possibilities)
```
probs = F.softmax(logits, dim=-1)
limit = torch.pow(torch.max(probs), 2.0) * 0.02
logits[probs < limit] = -float('Inf')
```
# Performance
Character-level loss on simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip

Loading…
Cancel
Save