This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.
Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.
Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.