29 Commits (8cc82ae7e20256f432d0e0c16c9fe384dedafa8b)

Author SHA1 Message Date
Mikko Juola 8cc82ae7e2 Make separate matrix_vector_muls for 4-bit quantization rather than using matrix_mul for them. 3 years ago
Mikko Juola 2f3e9bc0f5 K4 bit inference works now. Performance isn't as good as I'd like it to be though. 3 years ago
Mikko Juola 40121e1c82 Multithread the k4 * f32 matrix multiplication. 3 years ago
Mikko Juola b8946da2d8 Implement matrix multiplication for 4-bit * 32-bit floats.
As of this commit, test works. But I want to optimize this a bit, seeing
if increasing load instruction : arithmetic instruction ratio will make
single-threaded performance a bit speedier.
3 years ago
Mikko Juola f6249e8d9f Add skeleton code for 4-bit quantization.
The type is now recognized and I have a very simple quantizer too but no
operations are done yet.
3 years ago
Mikko Juola 5e241722cb Fix compilation when opencl feature is being used. 3 years ago
Mikko Juola b9be485610 Add simple HTTP API support.
It took annoyingly a lot of effort just to make this simple server.

I tried rouille web framework first, but it didn't support getting
chunked output to the client line-by-line. (seems that if it exposed
more details about the underlying tiny-http package I could have hacked
it to work).

I went with Rocket because it had less async stuff and seemed decent.

I got weird issues where it seemed as if memory use kept increasing and
increasing. I may have got that fixed but I couldn't figure out what
made it use so much memory, even tools like valgrind and heaptrack told
me there isn't that much memory allocated but I can see RES increasing
in `htop`.

Switched to MiMalloc as it seems to slightly decrease memory use.

Added details about the inference server to README.md. And also added an
example Python script of it.

I want to use this feature to later investigate how much do
quantizations or f16/f32 affect output. Easier to do such things on
Python.
3 years ago
Mikko Juola 9c86c17318 Refactor all SIMD to one file, simd_support.rs
This should make it a bit easier to port to other SIMD instruction sets
when the SIMD instructions are not littered randomly around the
tensor.rs file.
3 years ago
Mikko Juola ff349eeea0 Make number of threads configurable and obtained by default from the system rather than hardcoding to 32. 3 years ago
Mikko Juola 3d0afcf243 Make matrix multiplication multithreaded.
This improves performance greatly with f16. It's faster now than OpenCL
on LLaMA-7B.
3 years ago
Mikko Juola 8134c20d57 We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly. 3 years ago
Mikko Juola 1f5e687298 Modest improvement to f16 matrix_vector_mul_transposed without OpenCL.
It's still signicantly slower than the f32 version.
3 years ago
Mikko Juola acfd6bd5bd Add f16, non-OpenCL version of matrix_vector_mul_transposed as well.
This seems to be 100% slower than the pure f32 version in benchmark. Not
sure why as of this commit, but I'll investigate further.
3 years ago
Mikko Juola baecd25ee3 Add f16 version of matrix multiplication that works without any OpenCL.
In benchmark it is modestly faster than f32. The main transformer loop
doesn't know how to use f16 yet though, and I need to implement some
other ops for that to start working.
3 years ago
Mikko Juola a1970b8a9c Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio. 3 years ago
Mikko Juola 61bc42b728 Improve the handwritten AVX2 for matrix_mul_inplace_transposed.
This is something like ~60% faster than old version.
3 years ago
Mikko Juola 0cce655763 Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation.
Modest improvement in overall performance for text generation.
3 years ago
Mikko Juola 8aef5d8831 Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent. 3 years ago
Mikko Juola 8c64313fec Rewrite the matrix multiplication.
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.

Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
3 years ago
Mikko Juola 63d27dba90 Add partial OpenCL support, it's used in feed forward network only. 3 years ago
Mikko Juola 1a88482988 Add some OpenCL bits.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.

Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
3 years ago
Mikko Juola a92017bf56 Add some initial OpenCL stuff.
I can copy tensors to GPU and back but not much more. Maybe next time
I'll try implementing matrix_mul_transposed or something on the GPU.
3 years ago
Mikko Juola 53d367e6fa Add some beginnings of OpenCL implementation.
I think I'll try to get the smaller modules run faster.
3 years ago
Mikko Juola 846759b277 Optimize conversions to and from f16<->32.
x86 cannot do f16 operations natively, but it does have an instruction
to convert them to f32. I optimized those to use SIMD instructions.
3 years ago
Mikko Juola 26d5309cf7 Add support for bigger models.
I've tested with 13B LLaMA model and it seems to work.

There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.

I added functions to tensor.rs to be able construct tensors out of
multiple files.
3 years ago
Mikko Juola f103871bc0 Make the output colored. This is essential to be taken seriously.
Also did some clippy happiness changes.
3 years ago
Mikko Juola d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file. 3 years ago
Mikko Juola f6217e0036 Add readme, make clippy happy. 3 years ago
Mikko Juola 3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. 3 years ago