You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
|
3 years ago | |
|---|---|---|
| proto | 3 years ago | |
| src | 3 years ago | |
| .gitignore | 3 years ago | |
| Cargo.lock | 3 years ago | |
| Cargo.toml | 3 years ago | |
| LICENSE | 3 years ago | |
| LICENSE.third_parties | 3 years ago | |
| README.md | 3 years ago | |
| build.rs | 3 years ago | |
README.md
AdeonLLaMA
This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation.
As of writing of this, it can run LLaMA-7B at around ~1 token per second, using something like 1.5 threads because I haven't yet properly figured out how to multithread this.
It uses AVX2 intrinsics to speed up itself.
How to run
You will need the LLaMA-7B weights first. Refer to https://github.com/facebookresearch/llama/
Once you have 7B weights, and the tokenizer.model it comes with, you can make
it generate tokens:
cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B