Mikko Juola
44e0abf0f1
Clarify that the OpenCL implementations all use f16.
3 years ago
Mikko Juola
58463458ee
Put benchmarks on top of README.md.
3 years ago
Mikko Juola
882ff05254
Update README.md for new benchmarks.
3 years ago
Mikko Juola
3d0afcf243
Make matrix multiplication multithreaded.
...
This improves performance greatly with f16. It's faster now than OpenCL
on LLaMA-7B.
3 years ago
Mikko Juola
8134c20d57
We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.
3 years ago
Mikko Juola
1f5e687298
Modest improvement to f16 matrix_vector_mul_transposed without OpenCL.
...
It's still signicantly slower than the f32 version.
3 years ago
Mikko Juola
acfd6bd5bd
Add f16, non-OpenCL version of matrix_vector_mul_transposed as well.
...
This seems to be 100% slower than the pure f32 version in benchmark. Not
sure why as of this commit, but I'll investigate further.
3 years ago
Mikko Juola
baecd25ee3
Add f16 version of matrix multiplication that works without any OpenCL.
...
In benchmark it is modestly faster than f32. The main transformer loop
doesn't know how to use f16 yet though, and I need to implement some
other ops for that to start working.
3 years ago
Mikko Juola
a1970b8a9c
Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.
3 years ago
Mikko Juola
61bc42b728
Improve the handwritten AVX2 for matrix_mul_inplace_transposed.
...
This is something like ~60% faster than old version.
3 years ago
Mikko Juola
0cce655763
Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation.
...
Modest improvement in overall performance for text generation.
3 years ago
Mikko Juola
09f76dfcfa
Update README.md opening with new benchmark numbers.
3 years ago
Mikko Juola
4b8accee44
Update benchmarks.
3 years ago
Mikko Juola
de5dd59277
Some code cleanup in OpenCL.
3 years ago
Mikko Juola
8aef5d8831
Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.
3 years ago
Mikko Juola
1c5ec04217
Add a different kernel to be used when OpenCL device is a CPU.
...
This is almost the same code I had before. It runs better on CPUs rather
than GPUs.
3 years ago
Mikko Juola
8c64313fec
Rewrite the matrix multiplication.
...
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.
Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
3 years ago
Mikko Juola
862d4a15d6
Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable.
3 years ago
Mikko Juola
f4629ca987
Respect the stop token from the model.
3 years ago
Mikko Juola
de477314ed
Fix newlines not recognized when feeding newlines in the prompt.
...
Tokenizer would misinterpret the newlines. In general, the non-printable
control characters don't seem to be tokenized correctly at the moment. I
added band-aid for newlines but should maybe fix the others too.
3 years ago
Mikko Juola
687bbf1249
Add instructions on how to use OpenCL in the README.md
3 years ago
Mikko Juola
8de18bdc77
Add screenshot to README.md.
3 years ago
Mikko Juola
a2e88c1193
Update README.md
3 years ago
Mikko Juola
b4d5cf91a7
Mention in README.md that using OpenCL does not cast weights to 32-bit floats.
3 years ago
Mikko Juola
99da6ed71a
Update README.md benchmarks for new attention OpenCL thing.
3 years ago
Mikko Juola
35b0c372a8
Implement some attention operations for OpenCL.
3 years ago
Mikko Juola
6e456e64f3
Add new benchmarks now that this is partially OpenCLified.
3 years ago
Mikko Juola
63d27dba90
Add partial OpenCL support, it's used in feed forward network only.
3 years ago
Mikko Juola
df079bceb0
Add records of my benchmarks to README.md so I can compare it later.
3 years ago
Mikko Juola
c9c861d199
Add some measurements so we can get tokens per second.
3 years ago
Mikko Juola
22792b26cc
Add an idea about on-disk cache for initial prompt processing (not for weights).
3 years ago
Mikko Juola
9087c50efa
Add notes about improving sampler to README.md
3 years ago
Mikko Juola
1a88482988
Add some OpenCL bits.
...
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.
Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
3 years ago
Mikko Juola
a92017bf56
Add some initial OpenCL stuff.
...
I can copy tensors to GPU and back but not much more. Maybe next time
I'll try implementing matrix_mul_transposed or something on the GPU.
3 years ago
Mikko Juola
53d367e6fa
Add some beginnings of OpenCL implementation.
...
I think I'll try to get the smaller modules run faster.
3 years ago
Mikko Juola
846759b277
Optimize conversions to and from f16<->32.
...
x86 cannot do f16 operations natively, but it does have an instruction
to convert them to f32. I optimized those to use SIMD instructions.
3 years ago
Mikko Juola
8acb9f32b8
Update README.md for new discoveries.
3 years ago
Mikko Juola
26d5309cf7
Add support for bigger models.
...
I've tested with 13B LLaMA model and it seems to work.
There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.
I added functions to tensor.rs to be able construct tensors out of
multiple files.
3 years ago
Mikko Juola
8a427bcb21
The project is actually called rllama, put that in readme.md.
3 years ago
Mikko Juola
18ef805458
Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable.
3 years ago
Mikko Juola
f103871bc0
Make the output colored. This is essential to be taken seriously.
...
Also did some clippy happiness changes.
3 years ago
Mikko Juola
cd28aba5e2
Make the output look nicer.
3 years ago
Mikko Juola
d7a3f57510
Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.
3 years ago
Mikko Juola
8bb9404168
Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff.
3 years ago
Mikko Juola
f6217e0036
Add readme, make clippy happy.
3 years ago
Mikko Juola
3b8f904f13
First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.
3 years ago