You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
rllama/src
Mikko Juola b9be485610 Add simple HTTP API support.
It took annoyingly a lot of effort just to make this simple server.

I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).

I went with Rocket because it had less async stuff and seemed decent.

I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.

Switched to MiMalloc as it seems to slightly decrease memory use.

Added details about the inference server to README.md. And also added an
example Python script of it.

I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
3 years ago
..
benches Make matrix multiplication multithreaded. 3 years ago
protomodels First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. 3 years ago
embedding.rs Add support for bigger models. 3 years ago
lib.rs Add simple HTTP API support. 3 years ago
main.rs Add simple HTTP API support. 3 years ago
rllama_main.rs Add simple HTTP API support. 3 years ago
simd_support.rs Refactor all SIMD to one file, simd_support.rs 3 years ago
tensor.rs Add simple HTTP API support. 3 years ago
tensor_opencl_support.rs Some code cleanup in OpenCL. 3 years ago
token_sampler.rs Add simple HTTP API support. 3 years ago
tokenizer.rs Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio. 3 years ago
transformer.rs Add simple HTTP API support. 3 years ago
unpickler.rs Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio. 3 years ago