diff --git a/Cargo.lock b/Cargo.lock index 0d2c2eb..2eb5405 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -898,7 +898,7 @@ checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848" [[package]] name = "rllama" -version = "0.1.0" +version = "0.3.0" dependencies = [ "approx", "clap 4.1.10", diff --git a/README.md b/README.md index 1513d54..3262f86 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,16 @@ # RLLaMA -This is my attempt at making the LLaMA language model working on a pure Rust -CPU implementation. I was inspired by an amazing CPU implementation here: -https://github.com/ggerganov/ggml that could run GPT-J 6B models. +RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/). + +## Supported features + + * Use either `f16` and `f32` weights. + * LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B + likely works but I haven't found a big enough computer to run it. + * Multithreaded hand-optimized CPU inference + * OpenCL support for GPU inference. + +## Performance The current performance is as follows: @@ -24,20 +32,16 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1232ms / token (Open LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token (OpenCL on CPU) ``` -(Scroll to the bottom to see benchmarks over time). - -I have not tried to run LLaMA-60B but I think it would work if you got a big -enough computer. +Scroll to the bottom of this README.md to see benchmarks over time. -It also has a Python unpickler that understands the `.pth` files used by -PyTorch. Well almost, it doesn't unzip them automatically (see below). +## Screenshot -The implementation uses AVX2, even in the OpenCL codepath, so this will only -run on AMD64 at this time. +![Screenshot of RLLaMA in action](rllama.gif) -# Crates.io Cargo package install +## Install -As of March 18, `rllama` is on `crates.io`. You can install it with `cargo install rllama`. You may need to explicitly enable AVX2 features: +You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you +likely need to enable them to install the executable. ``` RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama @@ -46,55 +50,60 @@ RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama There is a `.cargo/config.toml` inside this repository that will enable these features if you install manually from this Git repository instead. -# How to run +## LLaMA weights -You will need Rust. Make sure you can run `cargo` from a command line. In -particular, this is using unstable features so you need nightly rust. Make sure -that if you write `cargo --version` it shows that it is nightly Rust. +Refer to https://github.com/facebookresearch/llama/ As of now, you need to be +approved to get weights. -You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/ +For LLaMA-7B make sure, you got these files: -Once you have 7B weights, and the `tokenizer.model` it comes with, you need to -decompress it. +```shell +* 7B/consolidated.00.pth +* 7B/params.json +* tokenizer.model +``` + +The `consolidated.00.pth` is actually a zip file. You need to unzip it: ```shell -$ cd LLaMA $ cd 7B $ unzip consolidated.00.pth -# For LLaMA-7B, rename consolidated to consolidated.00 -# For the larger models, the number is there already so no need to do this step. $ mv consolidated consolidated.00 ``` -You should then be ready to generate some text. +If you are using a larger model like LLaMA-13B, then you can skip the last step +of renaming the `consolidated` directory. -```shell -cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is" -``` +You should now be ready to generate some text. -By default, it will use the weights in the precision they are in the source -files. You can use `--f16` command line argument to cast the largest weight -matrices to float16. Also, using OpenCL will also cast the weight matrices to -float16. +## Example -You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler -settings. +Run LLaMA-7B with some weights casted to 16-bit floats: -There is `--repetition-penalty` setting. 1.0 means no penalty. This value -likely should be between 0 and 1. Values smaller than 1.0 give a penalty to -tokens that appear in the context, by -`x*(repetitition_penalty^num_occurrences)` before applying `softmax()` on the -output probabilities. Or in other words, values smaller than 1.0 apply penalty. +```shell +rllama --tokenizer-model /path/to/tokenizer.model \ + --model-path /path/to/LLaMA/7B \ + --param-path /path/to/LLaMA/7B/params.json \ + --f16 \ + --prompt "The meaning of life is" +``` -You can also use `--prompt-file` to read the prompt from a file instead from -the command line. +Use `rllama --help` to see all the options. -# How to turn on OpenCL +## How to turn on OpenCL Use `opencl` Cargo feature. ``` -cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is" +RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl +``` + +``` +rllama --tokenizer-model /path/to/tokenizer.model \ + --model-path /path/to/LLaMA/7B \ + --param-path /path/to/LLaMA/7B/params.json \ + --opencl-device 0 \ + --prompt "The meaning of life is" ``` With `opencl` feature, there is also another argument, `--opencl-device` that @@ -102,11 +111,9 @@ takes a number. That number selects Nth OpenCL device found on the system. You can see the devices in the output when you run the program (e.g. see the screenshot below). -# Screenshot - -![Screenshot of RLLaMA in action](rllama.png) +Weights are always cast to 16-bit floats for OpenCL. -# Notes and future plans +## Notes and future plans This is a hobby thing for me so don't expect updates or help. @@ -126,7 +133,7 @@ This is a hobby thing for me so don't expect updates or help. * Stanford released some instruct-finetuned LLaMA-7B, once I find the weights then I'd like to try make a chat-like command-line interface. -# Benchmarks +## Benchmarks I'm trying to track that I'm making this faster and not slower. diff --git a/rllama.gif b/rllama.gif new file mode 100644 index 0000000..bf0b75d Binary files /dev/null and b/rllama.gif differ diff --git a/rllama.png b/rllama.png deleted file mode 100644 index 6b6eff0..0000000 Binary files a/rllama.png and /dev/null differ