It took annoyingly a lot of effort just to make this simple server.
I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).
I went with Rocket because it had less async stuff and seemed decent.
I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.
Switched to MiMalloc as it seems to slightly decrease memory use.
Added details about the inference server to README.md. And also added an
example Python script of it.
I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.
Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
I've tested with 13B LLaMA model and it seems to work.
There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.
I added functions to tensor.rs to be able construct tensors out of
multiple files.