Rust+OpenCL+AVX2 implementation of LLaMA inference code

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Mikko Juola d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.		3 years ago
proto	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
src	Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.	3 years ago
.gitignore	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.lock	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.toml	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE.third_parties	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
README.md	Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.	3 years ago
build.rs	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

README.md

AdeonLLaMA

This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation. I was inspired by an amazing CPU implementation here: https://github.com/ggerganov/ggml that could run GPT-J 8B models.

As of writing of this, this can run LLaMA-7B at around ~1 token per second, on a Ryzen 3950X using something like 1.5 threads because I haven't yet properly figured out how to multithread this.

It uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family CPU to run this.

It has a Python unpickler that understands the .pth files used by PyTorch. Well sort of, it doesn't unzip them automatically (see below).

How to run

You will need Rust. Make sure you can run cargo from a command line.

You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/

Once you have 7B weights, and the tokenizer.model it comes with, you need to decompress it.

$ cd LLaMA
$ cd 7B
$ unzip consolidated.00.pth

You should then be ready to generate some text.

cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B/consolidated/data.pkl --prompt "The meaning of life is"

Right now it seems to use around ~25 gigabytes of memory. Internally all weights are cast to 32-bit floats.

You can use --temperature, --top-p and --top-k to adjust token sampler settings.

Future plans

This is a hobby thing for me so don't expect updates or help.