As of this commit, test works. But I want to optimize this a bit, seeing
if increasing load instruction : arithmetic instruction ratio will make
single-threaded performance a bit speedier.
It took annoyingly a lot of effort just to make this simple server.
I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).
I went with Rocket because it had less async stuff and seemed decent.
I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.
Switched to MiMalloc as it seems to slightly decrease memory use.
Added details about the inference server to README.md. And also added an
example Python script of it.
I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
This should make it a bit easier to port to other SIMD instruction sets
when the SIMD instructions are not littered randomly around the
tensor.rs file.
In benchmark it is modestly faster than f32. The main transformer loop
doesn't know how to use f16 yet though, and I need to implement some
other ops for that to start working.
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.
Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.
Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
I've tested with 13B LLaMA model and it seems to work.
There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.
I added functions to tensor.rs to be able construct tensors out of
multiple files.