From 25e3e12d9d5f4941840f58984f161d6827935d17 Mon Sep 17 00:00:00 2001 From: Mikko Juola Date: Sat, 18 Mar 2023 09:52:09 -0700 Subject: [PATCH] Update README.md on LLaMA-65B benchmark result. --- README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index f0dddaa..fb94f5e 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,9 @@ RLLaMA is a pure Rust implementation of [LLaMA large language model inference.]( ## Supported features - * Use either `f16` and `f32` weights. - * LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B - likely works but I haven't found a big enough computer to run it. - * Multithreaded hand-optimized CPU inference + * Uses either `f16` and `f32` weights. + * LLaMA-7B, LLaMA-13B, LLaMA-30B, LLaMA-65B all confirmed working + * Hand-optimized AVX2 implementation * OpenCL support for GPU inference. ## Performance @@ -22,6 +21,7 @@ LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32 (pure LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16 (pure Rust) LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32 (pure Rust) LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16 (pure Rust) +LLaMA-65B: AMD Ryzen 5950X: 4186ms / token f16 (pure Rust) OpenCL (all use f16): @@ -181,10 +181,13 @@ LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token # I've been focusing on making the ordinary non-OpenCL CPU implementation # faster and I got some gains, most importantly from multithreading. # There is Float16 support now, so I've added f16/f32 to these tables: +# +# I also managed to run LLaMA-65B for the first time. LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16 LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32 LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16 LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32 LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16 +LLaMA-65B: AMD Ryzen 5950X: 4186ms / token f16 ```