|
|
|
|
@ -143,9 +143,20 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1226ms / token
|
|
|
|
|
# commit de5dd592777b3a4f5a9e8c93c8aeef25b9294364 (15 March 2023)
|
|
|
|
|
# The matrix multiplication on GPU is now much faster. It didn't have that much
|
|
|
|
|
# effect overall though, but I got modest improvement on LLaMA-7B GPU.
|
|
|
|
|
|
|
|
|
|
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 247ms / token
|
|
|
|
|
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 680ms / token
|
|
|
|
|
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: <ran out of GPU memory>
|
|
|
|
|
LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1232ms / token
|
|
|
|
|
LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token
|
|
|
|
|
|
|
|
|
|
# commit 3d0afcf24309f28ec540ed7645c35400a865ad6f
|
|
|
|
|
# I've been focusing on making the ordinary non-OpenCL CPU implementation
|
|
|
|
|
# faster and I got some gains, most importantly from multithreading.
|
|
|
|
|
# There is Float16 support now, so I've added f16/f32 to these tables:
|
|
|
|
|
|
|
|
|
|
LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16
|
|
|
|
|
LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32
|
|
|
|
|
LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16
|
|
|
|
|
LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32
|
|
|
|
|
```
|
|
|
|
|
|