13 Commits (b9be485610edfbb3bfc97633bfa72c436f7a5516)

Author SHA1 Message Date
Mikko Juola b9be485610 Add simple HTTP API support.
It took annoyingly a lot of effort just to make this simple server.

I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).

I went with Rocket because it had less async stuff and seemed decent.

I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.

Switched to MiMalloc as it seems to slightly decrease memory use.

Added details about the inference server to README.md. And also added an
example Python script of it.

I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
3 years ago
Mikko Juola 3d0afcf243 Make matrix multiplication multithreaded.
This improves performance greatly with f16. It's faster now than OpenCL
on LLaMA-7B.
3 years ago
Mikko Juola 8134c20d57 We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly. 3 years ago
Mikko Juola a1970b8a9c Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio. 3 years ago
Mikko Juola 8aef5d8831 Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent. 3 years ago
Mikko Juola 35b0c372a8 Implement some attention operations for OpenCL. 3 years ago
Mikko Juola 63d27dba90 Add partial OpenCL support, it's used in feed forward network only. 3 years ago
Mikko Juola 26d5309cf7 Add support for bigger models.
I've tested with 13B LLaMA model and it seems to work.

There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.

I added functions to tensor.rs to be able construct tensors out of
multiple files.
3 years ago
Mikko Juola 18ef805458 Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable. 3 years ago
Mikko Juola f103871bc0 Make the output colored. This is essential to be taken seriously.
Also did some clippy happiness changes.
3 years ago
Mikko Juola d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file. 3 years ago
Mikko Juola f6217e0036 Add readme, make clippy happy. 3 years ago
Mikko Juola 3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. 3 years ago