rllama

Commit Graph

Author	SHA1	Message	Date
Mikko Juola	b9be485610	Add simple HTTP API support. It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.	3 years ago
Mikko Juola	3d0afcf243	Make matrix multiplication multithreaded. This improves performance greatly with f16. It's faster now than OpenCL on LLaMA-7B.	3 years ago
Mikko Juola	8134c20d57	We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.	3 years ago
Mikko Juola	a1970b8a9c	Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.	3 years ago
Mikko Juola	8aef5d8831	Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.	3 years ago
Mikko Juola	35b0c372a8	Implement some attention operations for OpenCL.	3 years ago
Mikko Juola	63d27dba90	Add partial OpenCL support, it's used in feed forward network only.	3 years ago
Mikko Juola	26d5309cf7	Add support for bigger models. I've tested with 13B LLaMA model and it seems to work. There was a bug in unpickler that skipped over tuples of size 1. I had written bunch of code assuming there is no bug which I fixed and removed some unpickling code. I added functions to tensor.rs to be able construct tensors out of multiple files.	3 years ago
Mikko Juola	18ef805458	Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable.	3 years ago
Mikko Juola	f103871bc0	Make the output colored. This is essential to be taken seriously. Also did some clippy happiness changes.	3 years ago
Mikko Juola	d7a3f57510	Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.	3 years ago
Mikko Juola	f6217e0036	Add readme, make clippy happy.	3 years ago
Mikko Juola	3b8f904f13	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

13 Commits (b9be485610edfbb3bfc97633bfa72c436f7a5516)