From a2e88c1193078976b4e5f27d181b4a54afd0befa Mon Sep 17 00:00:00 2001 From: Mikko Juola Date: Mon, 13 Mar 2023 21:42:03 -0700 Subject: [PATCH] Update README.md --- README.md | 35 +++++++++++++++++++---------------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index f399f56..98ceb36 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,14 @@ This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation. I was inspired by an amazing CPU implementation here: -https://github.com/ggerganov/ggml that could run GPT-J 8B models. +https://github.com/ggerganov/ggml that could run GPT-J 6B models. -As of writing of this, this can run LLaMA-7B at around ~1 token per second, on -a Ryzen 3950X using something like 1.5 threads because I haven't yet properly -figured out how to multithread this. +With my crappy OpenCL, this will do around ~270ms on my GTX 3090 per token. +With pure CPU on Ryzen 3950X and OpenCL, I can get around 700ms per token. And +without any OpenCL, pure Rust code only, with some of my handwritten AVX2 +intrinsics, about 1 second per token. All on LLaMA-7B. + +(Scroll to the bottom to see some benchmarks) I've also managed to run LLaMA-13B which just barely fits in my 64-gig machine with 32-bit float weights everywhere. @@ -17,9 +20,6 @@ all the weights around so generating a token takes minutes. I have not tried LLaMA-60B but presumably if all the smaller models work it would run given a sufficiently chonky computer. -This uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family -CPU to run this. - It also has a Python unpickler that understands the `.pth` files used by PyTorch. Well almost, it doesn't unzip them automatically (see below). @@ -27,7 +27,7 @@ PyTorch. Well almost, it doesn't unzip them automatically (see below). You will need Rust. Make sure you can run `cargo` from a command line. In particular, this is using unstable features so you need nightly rust. Make sure -if you write `cargo --version` it is nightly. +that if you write `cargo --version` it shows that it is nightly Rust. You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/ @@ -50,31 +50,34 @@ cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path / ``` Right now it seems to use around ~25 gigabytes of memory for 7B and around ~50 -gigabytes for 13B. If you don't use OpenCL, then all parameters are cast to -32-bit floats. +gigabytes for 13B. If you don't use OpenCL, then internally all parameters are +cast to 32-bit floats. You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler settings. -# Future plans +# Notes and future plans This is a hobby thing for me so don't expect updates or help. * Some other CPU implementations use quantization to reduce the size of weights + and generally speed up everything a lot. * Put some of the operations on the OpenCL GPU/CPU. I've made some initial - OpenCL code but it is not used in the transformer loop yet. The CPU OpenCL - improves my own AVX2 code by like 100% and massively so on GPU although I am - also like 20x slower than equivalent operation on PyTorch on the same GPU. + OpenCL code for matrix multiplications but the performance is not competetive + with frameworks like PyTorch on GPU. * I've heard there is some thing called Tensor Cores on nVidia GPUs. Not accessible with OpenCL. But might be accessible on Vulkan with a an extension. * More sophisticated token sampling. I saw on Hackernews some comments how the - samplers are kinda garbage and you can get much better results with good - defaults and things like repetition penalty. + samplers included in Facebook's reference code are kinda garbage and you can + get much better results with good defaults and things like repetition + penalty. * There is an initial start-up time as the program has to pass through the initial prompt. I don't know if this start-up time can be eliminated completely but it could be cached on disk. Use cases like having a standard prompt to prime the text generation that you reuse many times. +* Stanford released some instruct-finetuned LLaMA-7B, once I find the weights + then I'd like to try make a chat-like command-line interface. # Benchmarks