In benchmark it is modestly faster than f32. The main transformer loop
doesn't know how to use f16 yet though, and I need to implement some
other ops for that to start working.
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.
Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.
Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
I've tested with 13B LLaMA model and it seems to work.
There was a bug in unpickler that skipped over tuples of size 1. I had
written bunch of code assuming there is no bug which I fixed and removed
some unpickling code.
I added functions to tensor.rs to be able construct tensors out of
multiple files.