From 09f76dfcfa3af3271d0fbc391c58640ee3551dd2 Mon Sep 17 00:00:00 2001
From: Mikko Juola <mikjuo@gmail.com>
Date: Wed, 15 Mar 2023 12:47:32 -0700
Subject: [PATCH] Update README.md opening with new benchmark numbers.

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index eea3f95..030f99e 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ This is my attempt at making the LLaMA language model working on a pure Rust
 CPU implementation. I was inspired by an amazing CPU implementation here:
 https://github.com/ggerganov/ggml that could run GPT-J 6B models.
 
-With my crappy OpenCL, this will do around ~270ms on my GTX 3090 per token.
+With my crappy OpenCL, this will do around ~240ms on my GTX 3090 per token.
 With pure CPU on Ryzen 3950X and OpenCL, I can get around 700ms per token. And
 without any OpenCL, pure Rust code only, with some of my handwritten AVX2
 intrinsics, about 1 second per token. All on LLaMA-7B.
@@ -14,8 +14,8 @@ intrinsics, about 1 second per token. All on LLaMA-7B.
 I've also managed to run LLaMA-13B which just barely fits in my 64-gig machine
 with 32-bit float weights everywhere.
 
-LLaMA-30B technically runs but my computer does not have enough memory to keep
-all the weights around so generating a token takes minutes.
+I've managed to run LLaMA-30B on a 128 gigabyte server and it gets around 4
+seconds per token using CPU OpenCL for Ryzen 5950X.
 
 I have not tried LLaMA-60B but presumably if all the smaller models work it
 would run given a sufficiently chonky computer.