rllama

Commit Graph

Author	SHA1	Message	Date
Mikko Juola	26f343ad15	Add a flag that will exit the HTTP server after just one query. This is for some experiments I want to run to kill the server gracefully whenever I pull out the logits out of it from a Python script.	3 years ago
Mikko Juola	957a8f9f98	Mention that `server` feature must be turned on to use the inference API.	3 years ago
Mikko Juola	d85ed7f23e	Mention HTTP server in features in README.md	3 years ago
Mikko Juola	a8320613a1	Fix some things in README.md after proofreading it and removing lies.	3 years ago
Mikko Juola	b9be485610	Add simple HTTP API support. It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.	3 years ago
Mikko Juola	25e3e12d9d	Update README.md on LLaMA-65B benchmark result.	3 years ago
Mikko Juola	f233f8ad8f	Forgot to mark last benchmark at March 17	3 years ago
Mikko Juola	db0f22ed26	Update README.md, add a nice animation.	3 years ago
Mikko Juola	016b609481	More install instructions.	3 years ago
Mikko Juola	2666571e2b	Update README.md to show `rllama` is on crates.io now.	3 years ago
Mikko Juola	109171b50e	Mention that this is AMD64 only because of AVX2.	3 years ago
Mikko Juola	44e0abf0f1	Clarify that the OpenCL implementations all use f16.	3 years ago
Mikko Juola	58463458ee	Put benchmarks on top of README.md.	3 years ago
Mikko Juola	882ff05254	Update README.md for new benchmarks.	3 years ago
Mikko Juola	8134c20d57	We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.	3 years ago
Mikko Juola	09f76dfcfa	Update README.md opening with new benchmark numbers.	3 years ago
Mikko Juola	4b8accee44	Update benchmarks.	3 years ago
Mikko Juola	862d4a15d6	Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable.	3 years ago
Mikko Juola	f4629ca987	Respect the stop token from the model.	3 years ago
Mikko Juola	687bbf1249	Add instructions on how to use OpenCL in the README.md	3 years ago
Mikko Juola	8de18bdc77	Add screenshot to README.md.	3 years ago
Mikko Juola	a2e88c1193	Update README.md	3 years ago
Mikko Juola	b4d5cf91a7	Mention in README.md that using OpenCL does not cast weights to 32-bit floats.	3 years ago
Mikko Juola	99da6ed71a	Update README.md benchmarks for new attention OpenCL thing.	3 years ago
Mikko Juola	6e456e64f3	Add new benchmarks now that this is partially OpenCLified.	3 years ago
Mikko Juola	df079bceb0	Add records of my benchmarks to README.md so I can compare it later.	3 years ago
Mikko Juola	22792b26cc	Add an idea about on-disk cache for initial prompt processing (not for weights).	3 years ago
Mikko Juola	9087c50efa	Add notes about improving sampler to README.md	3 years ago
Mikko Juola	1a88482988	Add some OpenCL bits. I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than my CPU implementation for GPU, and also quite a lot faster on CPU (OpenCL runs on CPU and GPU) than my own implementation. Basically it can destroy all of my crappy code. So I think I will be replacing some of my other operations with this stuff in near future.	3 years ago
Mikko Juola	8acb9f32b8	Update README.md for new discoveries.	3 years ago
Mikko Juola	26d5309cf7	Add support for bigger models. I've tested with 13B LLaMA model and it seems to work. There was a bug in unpickler that skipped over tuples of size 1. I had written bunch of code assuming there is no bug which I fixed and removed some unpickling code. I added functions to tensor.rs to be able construct tensors out of multiple files.	3 years ago
Mikko Juola	8a427bcb21	The project is actually called rllama, put that in readme.md.	3 years ago
Mikko Juola	d7a3f57510	Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.	3 years ago
Mikko Juola	8bb9404168	Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff.	3 years ago
Mikko Juola	f6217e0036	Add readme, make clippy happy.	3 years ago
Mikko Juola	3b8f904f13	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

36 Commits (master)