It took annoyingly a lot of effort just to make this simple server.
I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).
I went with Rocket because it had less async stuff and seemed decent.
I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.
Switched to MiMalloc as it seems to slightly decrease memory use.
Added details about the inference server to README.md. And also added an
example Python script of it.
I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
This should make it a bit easier to port to other SIMD instruction sets
when the SIMD instructions are not littered randomly around the
tensor.rs file.
In benchmark it is modestly faster than f32. The main transformer loop
doesn't know how to use f16 yet though, and I need to implement some
other ops for that to start working.
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.
Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
Tokenizer would misinterpret the newlines. In general, the non-printable
control characters don't seem to be tokenized correctly at the moment. I
added band-aid for newlines but should maybe fix the others too.