Rust+OpenCL+AVX2 implementation of LLaMA inference code

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Mikko Juola 3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.		3 years ago
proto	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
src	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
.gitignore	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.lock	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
Cargo.toml	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
LICENSE.third_parties	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
README.md	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago
build.rs	First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.	3 years ago

README.md

AdeonLLaMA

This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation.

As of writing of this, it can run LLaMA-7B at around ~1 token per second, using something like 1.5 threads because I haven't yet properly figured out how to multithread this.

It uses AVX2 intrinsics to speed up itself.

How to run

You will need the LLaMA-7B weights first. Refer to https://github.com/facebookresearch/llama/

Once you have 7B weights, and the tokenizer.model it comes with, you can make it generate tokens:

cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B