Update README.md on LLaMA-65B benchmark result.

3 years ago · 25e3e12d9d
parent f233f8ad8f
commit 25e3e12d9d
1 changed files with 7 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -4,10 +4,9 @@ RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](

 ## Supported features

-  * Use either `f16` and `f32` weights.
-  * LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B
-    likely works but I haven't found a big enough computer to run it.
-  * Multithreaded hand-optimized CPU inference
+  * Uses either `f16` and `f32` weights.
+  * LLaMA-7B, LLaMA-13B, LLaMA-30B, LLaMA-65B all confirmed working
+  * Hand-optimized AVX2 implementation
  * OpenCL support for GPU inference.

 ## Performance
@ -22,6 +21,7 @@ LLaMA-7B:  AMD Ryzen 3950X:                       1008ms / token    f32    (pure
 LLaMA-13B: AMD Ryzen 3950X:                       1029ms / token    f16    (pure Rust)
 LLaMA-13B: AMD Ryzen 3950X:                       1930ms / token    f32    (pure Rust)
 LLaMA-30B: AMD Ryzen 5950X:                       2112ms / token    f16    (pure Rust)
+LLaMA-65B: AMD Ryzen 5950X:                       4186ms / token    f16    (pure Rust)

 OpenCL (all use f16):

@ -181,10 +181,13 @@ LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token
 # I've been focusing on making the ordinary non-OpenCL CPU implementation
 # faster and I got some gains, most importantly from multithreading.
 # There is Float16 support now, so I've added f16/f32 to these tables:
+#
+# I also managed to run LLaMA-65B for the first time.

 LLaMA-7B:  AMD Ryzen 3950X: 552ms / token     f16
 LLaMA-7B:  AMD Ryzen 3950X: 1008ms / token    f32
 LLaMA-13B: AMD Ryzen 3950X: 1029ms / token    f16
 LLaMA-13B: AMD Ryzen 3950X: 1930ms / token    f32
 LLaMA-30B: AMD Ryzen 5950X: 2112ms / token    f16
+LLaMA-65B: AMD Ryzen 5950X: 4186ms / token    f16
 ```