36 Commits (master)

Author SHA1 Message Date
Mikko Juola 26f343ad15 Add a flag that will exit the HTTP server after just one query.
This is for some experiments I want to run to kill the server gracefully
whenever I pull out the logits out of it from a Python script.
3 years ago
Mikko Juola 957a8f9f98 Mention that `server` feature must be turned on to use the inference API. 3 years ago
Mikko Juola d85ed7f23e Mention HTTP server in features in README.md 3 years ago
Mikko Juola a8320613a1 Fix some things in README.md after proofreading it and removing lies. 3 years ago
Mikko Juola b9be485610 Add simple HTTP API support.
It took annoyingly a lot of effort just to make this simple server.

I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).

I went with Rocket because it had less async stuff and seemed decent.

I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.

Switched to MiMalloc as it seems to slightly decrease memory use.

Added details about the inference server to README.md. And also added an
example Python script of it.

I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
3 years ago
Mikko Juola 25e3e12d9d Update README.md on LLaMA-65B benchmark result. 3 years ago
Mikko Juola f233f8ad8f Forgot to mark last benchmark at March 17 3 years ago
Mikko Juola db0f22ed26 Update README.md, add a nice animation. 3 years ago
Mikko Juola 016b609481 More install instructions. 3 years ago
Mikko Juola 2666571e2b Update README.md to show `rllama` is on crates.io now. 3 years ago
Mikko Juola 109171b50e Mention that this is AMD64 only because of AVX2. 3 years ago
Mikko Juola 44e0abf0f1 Clarify that the OpenCL implementations all use f16. 3 years ago
Mikko Juola 58463458ee Put benchmarks on top of README.md. 3 years ago
Mikko Juola 882ff05254 Update README.md for new benchmarks. 3 years ago
Mikko Juola 8134c20d57 We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly. 3 years ago
Mikko Juola 09f76dfcfa Update README.md opening with new benchmark numbers. 3 years ago
Mikko Juola 4b8accee44 Update benchmarks. 3 years ago
Mikko Juola 862d4a15d6 Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable. 3 years ago
Mikko Juola f4629ca987 Respect the stop token from the model. 3 years ago
Mikko Juola 687bbf1249 Add instructions on how to use OpenCL in the README.md 3 years ago
Mikko Juola 8de18bdc77 Add screenshot to README.md. 3 years ago
Mikko Juola a2e88c1193 Update README.md 3 years ago
Mikko Juola b4d5cf91a7 Mention in README.md that using OpenCL does not cast weights to 32-bit floats. 3 years ago
Mikko Juola 99da6ed71a Update README.md benchmarks for new attention OpenCL thing. 3 years ago
Mikko Juola 6e456e64f3 Add new benchmarks now that this is partially OpenCLified. 3 years ago
Mikko Juola df079bceb0 Add records of my benchmarks to README.md so I can compare it later. 3 years ago
Mikko Juola 22792b26cc Add an idea about on-disk cache for initial prompt processing (not for weights). 3 years ago
Mikko Juola 9087c50efa Add notes about improving sampler to README.md 3 years ago
Mikko Juola 1a88482988 Add some OpenCL bits.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.

Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
3 years ago
Mikko Juola 8acb9f32b8 Update README.md for new discoveries. 3 years ago
Mikko Juola 26d5309cf7 Add support for bigger models.
I've tested with 13B LLaMA model and it seems to work.

There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.

I added functions to tensor.rs to be able construct tensors out of
multiple files.
3 years ago
Mikko Juola 8a427bcb21 The project is actually called rllama, put that in readme.md. 3 years ago
Mikko Juola d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file. 3 years ago
Mikko Juola 8bb9404168 Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff. 3 years ago
Mikko Juola f6217e0036 Add readme, make clippy happy. 3 years ago
Mikko Juola 3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. 3 years ago