Rust+OpenCL+AVX2 implementation of LLaMA inference code

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Mikko Juola f6217e0036 Add readme, make clippy happy.		3 years ago
proto	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
src	Add readme, make clippy happy.	3 years ago
.gitignore	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.lock	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.toml	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE.third_parties	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
README.md	Add readme, make clippy happy.	3 years ago
build.rs	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

README.md

AdeonLLaMA

This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation. I was inspired by an amazing CPU implementation here: https://github.com/ggerganov/ggml that could run GPT-J 8B models.

As of writing of this, this can run LLaMA-7B at around ~1 token per second, using something like 1.5 threads because I haven't yet properly figured out how to multithread this.

It uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family CPU to run this.

It has a Python unpickler that understands the .pth files used by PyTorch. Well sort of, it doesn't unzip them automatically (see below).

How to run

You will need the LLaMA-7B weights first. Refer to https://github.com/facebookresearch/llama/

Once you have 7B weights, and the tokenizer.model it comes with, you need to decompress it.

$ cd LLaMA
$ cd 7B
$ unzip consolidated.00.pth

You should then be ready to generate some text.

cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B/consolidated/data.pkl --prompt "The meaning of life is"

Right now it seems to use around ~25 gigabytes of memory. Internally all weights are cast to 32-bit floats.

Future plans

This is a hobby thing for me so don't expect updates or help.