Rust+OpenCL+AVX2 implementation of LLaMA inference code

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Mikko Juola c9c861d199 Add some measurements so we can get tokens per second.		3 years ago
proto	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
src	Add some measurements so we can get tokens per second.	3 years ago
.gitignore	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.lock	Add some beginnings of OpenCL implementation.	3 years ago
Cargo.toml	Add some beginnings of OpenCL implementation.	3 years ago
LICENSE	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE.third_parties	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
README.md	Add an idea about on-disk cache for initial prompt processing (not for weights).	3 years ago
build.rs	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

README.md

RLLaMA

This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation. I was inspired by an amazing CPU implementation here: https://github.com/ggerganov/ggml that could run GPT-J 8B models.

As of writing of this, this can run LLaMA-7B at around ~1 token per second, on a Ryzen 3950X using something like 1.5 threads because I haven't yet properly figured out how to multithread this.

I've also managed to run LLaMA-13B which just barely fits in my 64-gig machine with 32-bit float weights everywhere.

LLaMA-30B technically runs but my computer does not have enough memory to keep all the weights around so generating a token takes minutes.

I have not tried LLaMA-60B but presumably if all the smaller models work it would run given a sufficiently chonky computer.

This uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family CPU to run this.

It also has a Python unpickler that understands the .pth files used by PyTorch. Well almost, it doesn't unzip them automatically (see below).

How to run

You will need Rust. Make sure you can run cargo from a command line. In particular, this is using unstable features so you need nightly rust. Make sure if you write cargo --version it is nightly.

You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/

Once you have 7B weights, and the tokenizer.model it comes with, you need to decompress it.

$ cd LLaMA
$ cd 7B
$ unzip consolidated.00.pth
# For LLaMA-7B, rename consolidated to consolidated.00
# For the larger models, the number is there already so no need to do this step.
$ mv consolidated consolidated.00

You should then be ready to generate some text.

cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"

Right now it seems to use around ~25 gigabytes of memory for 7B and around ~50 gigabytes for 13B. Internally all weights are cast to 32-bit floats.

You can use --temperature, --top-p and --top-k to adjust token sampler settings.

Future plans

This is a hobby thing for me so don't expect updates or help.

Some other CPU implementations use quantization to reduce the size of weights
Put some of the operations on the OpenCL GPU/CPU. I've made some initial OpenCL code but it is not used in the transformer loop yet. The CPU OpenCL improves my own AVX2 code by like 100% and massively so on GPU although I am also like 20x slower than equivalent operation on PyTorch on the same GPU.
I've heard there is some thing called Tensor Cores on nVidia GPUs. Not accessible with OpenCL. But might be accessible on Vulkan with a an extension.
More sophisticated token sampling. I saw on Hackernews some comments how the samplers are kinda garbage and you can get much better results with good defaults and things like repetition penalty.
There is an initial start-up time as the program has to pass through the initial prompt. I don't know if this start-up time can be eliminated completely but it could be cached on disk. Use cases like having a standard prompt to prime the text generation that you reuse many times.