You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python. |
3 years ago | |
|---|---|---|
| .. | ||
| benches | 3 years ago | |
| protomodels | 3 years ago | |
| embedding.rs | 3 years ago | |
| lib.rs | 3 years ago | |
| main.rs | 3 years ago | |
| rllama_main.rs | 3 years ago | |
| simd_support.rs | 3 years ago | |
| tensor.rs | 3 years ago | |
| tensor_opencl_support.rs | 3 years ago | |
| token_sampler.rs | 3 years ago | |
| tokenizer.rs | 3 years ago | |
| transformer.rs | 3 years ago | |
| unpickler.rs | 3 years ago | |