7 Commits (8aef5d8831bf57e3ef11b964a9be108a3573de7b)

Author SHA1 Message Date
Mikko Juola 8aef5d8831 Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent. 3 years ago
Mikko Juola 1c5ec04217 Add a different kernel to be used when OpenCL device is a CPU.
This is almost the same code I had before. It runs better on CPUs rather
than GPUs.
3 years ago
Mikko Juola 8c64313fec Rewrite the matrix multiplication.
This is something like ~10 times faster than the old one. But
surprisingly this didn't have much impact on text generation time. Maybe
most of the remaining slowness is no more from matrix multiplication.

Also this slowed down CPU implementation. I think I'll try adding
another kernel later for CPU OpenCL.
3 years ago
Mikko Juola 63d27dba90 Add partial OpenCL support, it's used in feed forward network only. 3 years ago
Mikko Juola 1a88482988 Add some OpenCL bits.
I wrote an OpenCL matrix_mul_inplace_transposed. It is much faster than
my CPU implementation for GPU, and also quite a lot faster on CPU
(OpenCL runs on CPU and GPU) than my own implementation.

Basically it can destroy all of my crappy code. So I think I will be
replacing some of my other operations with this stuff in near future.
3 years ago
Mikko Juola a92017bf56 Add some initial OpenCL stuff.
I can copy tensors to GPU and back but not much more. Maybe next time
I'll try implementing matrix_mul_transposed or something on the GPU.
3 years ago
Mikko Juola 53d367e6fa Add some beginnings of OpenCL implementation.
I think I'll try to get the smaller modules run faster.
3 years ago