rllama

Commit Graph

Author	SHA1	Message	Date
Mikko Juola	b9be485610	Add simple HTTP API support. It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.	3 years ago
Mikko Juola	db0f22ed26	Update README.md, add a nice animation.	3 years ago
Mikko Juola	f2c38a272f	Update Cargo dependencies.	3 years ago
Mikko Juola	53d367e6fa	Add some beginnings of OpenCL implementation. I think I'll try to get the smaller modules run faster.	3 years ago
Mikko Juola	18ef805458	Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable.	3 years ago
Mikko Juola	cd28aba5e2	Make the output look nicer.	3 years ago
Mikko Juola	3b8f904f13	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

7 Commits (26f343ad1599aafc51ff68f72e493a859c6b29dd)