27 Commits (de477314edf6ff863f007fc536fe6e98699570a1)
 

Author SHA1 Message Date
Mikko Juola de477314ed Fix newlines not recognized when feeding newlines in the prompt.
Tokenizer would misinterpret the newlines. In general, the non-printable
control characters don't seem to be tokenized correctly at the moment. I
added band-aid for newlines but should maybe fix the others too.
3 years ago
Mikko Juola 687bbf1249 Add instructions on how to use OpenCL in the README.md 3 years ago
Mikko Juola 8de18bdc77 Add screenshot to README.md. 3 years ago
Mikko Juola a2e88c1193 Update README.md 3 years ago
Mikko Juola b4d5cf91a7 Mention in README.md that using OpenCL does not cast weights to 32-bit floats. 3 years ago
Mikko Juola 99da6ed71a Update README.md benchmarks for new attention OpenCL thing. 3 years ago
Mikko Juola 35b0c372a8 Implement some attention operations for OpenCL. 3 years ago
Mikko Juola 6e456e64f3 Add new benchmarks now that this is partially OpenCLified. 3 years ago
Mikko Juola 63d27dba90 Add partial OpenCL support, it's used in feed forward network only. 3 years ago
Mikko Juola df079bceb0 Add records of my benchmarks to README.md so I can compare it later. 3 years ago
Mikko Juola c9c861d199 Add some measurements so we can get tokens per second. 3 years ago
Mikko Juola 22792b26cc Add an idea about on-disk cache for initial prompt processing (not for weights). 3 years ago
Mikko Juola 9087c50efa Add notes about improving sampler to README.md 3 years ago
Mikko Juola 1a88482988 Add some OpenCL bits.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.

Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
3 years ago
Mikko Juola a92017bf56 Add some initial OpenCL stuff.
I can copy tensors to GPU and back but not much more. Maybe next time
I'll try implementing matrix_mul_transposed or something on the GPU.
3 years ago
Mikko Juola 53d367e6fa Add some beginnings of OpenCL implementation.
I think I'll try to get the smaller modules run faster.
3 years ago
Mikko Juola 846759b277 Optimize conversions to and from f16<->32.
x86 cannot do f16 operations natively, but it does have an instruction
to convert them to f32. I optimized those to use SIMD instructions.
3 years ago
Mikko Juola 8acb9f32b8 Update README.md for new discoveries. 3 years ago
Mikko Juola 26d5309cf7 Add support for bigger models.
I've tested with 13B LLaMA model and it seems to work.

There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.

I added functions to tensor.rs to be able construct tensors out of
multiple files.
3 years ago
Mikko Juola 8a427bcb21 The project is actually called rllama, put that in readme.md. 3 years ago
Mikko Juola 18ef805458 Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable. 3 years ago
Mikko Juola f103871bc0 Make the output colored. This is essential to be taken seriously.
Also did some clippy happiness changes.
3 years ago
Mikko Juola cd28aba5e2 Make the output look nicer. 3 years ago
Mikko Juola d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file. 3 years ago
Mikko Juola 8bb9404168 Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff. 3 years ago
Mikko Juola f6217e0036 Add readme, make clippy happy. 3 years ago
Mikko Juola 3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. 3 years ago