rllama

Commit Graph

Author	SHA1	Message	Date
Mikko Juola	3d0afcf243	Make matrix multiplication multithreaded. This improves performance greatly with f16. It's faster now than OpenCL on LLaMA-7B.	3 years ago
Mikko Juola	8134c20d57	We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.	3 years ago
Mikko Juola	1f5e687298	Modest improvement to f16 matrix_vector_mul_transposed without OpenCL. It's still signicantly slower than the f32 version.	3 years ago
Mikko Juola	acfd6bd5bd	Add f16, non-OpenCL version of matrix_vector_mul_transposed as well. This seems to be 100% slower than the pure f32 version in benchmark. Not sure why as of this commit, but I'll investigate further.	3 years ago
Mikko Juola	baecd25ee3	Add f16 version of matrix multiplication that works without any OpenCL. In benchmark it is modestly faster than f32. The main transformer loop doesn't know how to use f16 yet though, and I need to implement some other ops for that to start working.	3 years ago
Mikko Juola	a1970b8a9c	Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.	3 years ago
Mikko Juola	61bc42b728	Improve the handwritten AVX2 for matrix_mul_inplace_transposed. This is something like ~60% faster than old version.	3 years ago
Mikko Juola	0cce655763	Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation. Modest improvement in overall performance for text generation.	3 years ago
Mikko Juola	8aef5d8831	Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.	3 years ago
Mikko Juola	8c64313fec	Rewrite the matrix multiplication. This is something like ~10 times faster than the old one. But surprisingly this didn't have much impact on text generation time. Maybe most of the remaining slowness is no more from matrix multiplication. Also this slowed down CPU implementation. I think I'll try adding another kernel later for CPU OpenCL.	3 years ago
Mikko Juola	63d27dba90	Add partial OpenCL support, it's used in feed forward network only.	3 years ago
Mikko Juola	1a88482988	Add some OpenCL bits. I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than my CPU implementation for GPU, and also quite a lot faster on CPU (OpenCL runs on CPU and GPU) than my own implementation. Basically it can destroy all of my crappy code. So I think I will be replacing some of my other operations with this stuff in near future.	3 years ago
Mikko Juola	a92017bf56	Add some initial OpenCL stuff. I can copy tensors to GPU and back but not much more. Maybe next time I'll try implementing matrix_mul_transposed or something on the GPU.	3 years ago
Mikko Juola	53d367e6fa	Add some beginnings of OpenCL implementation. I think I'll try to get the smaller modules run faster.	3 years ago
Mikko Juola	846759b277	Optimize conversions to and from f16<->32. x86 cannot do f16 operations natively, but it does have an instruction to convert them to f32. I optimized those to use SIMD instructions.	3 years ago
Mikko Juola	26d5309cf7	Add support for bigger models. I've tested with 13B LLaMA model and it seems to work. There was a bug in unpickler that skipped over tuples of size 1. I had written bunch of code assuming there is no bug which I fixed and removed some unpickling code. I added functions to tensor.rs to be able construct tensors out of multiple files.	3 years ago
Mikko Juola	f103871bc0	Make the output colored. This is essential to be taken seriously. Also did some clippy happiness changes.	3 years ago
Mikko Juola	d7a3f57510	Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.	3 years ago
Mikko Juola	f6217e0036	Add readme, make clippy happy.	3 years ago
Mikko Juola	3b8f904f13	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

20 Commits (3d0afcf24309f28ec540ed7645c35400a865ad6f)