3 Commits (40121e1c82e00a3f9567c7af66d632b3d41fca22)

Author SHA1 Message Date
Mikko Juola 40121e1c82 Multithread the k4 * f32 matrix multiplication. 3 years ago
Mikko Juola b8946da2d8 Implement matrix multiplication for 4-bit * 32-bit floats.
As of this commit, test works. But I want to optimize this a bit, seeing
if increasing load instruction : arithmetic instruction ratio will make
single-threaded performance a bit speedier.
3 years ago
Mikko Juola 9c86c17318 Refactor all SIMD to one file, simd_support.rs
This should make it a bit easier to port to other SIMD instruction sets
when the SIMD instructions are not littered randomly around the
tensor.rs file.
3 years ago