Update README.md, add a nice animation.

master
Mikko Juola 3 years ago
parent cfad4b1205
commit db0f22ed26

2
Cargo.lock generated

@ -898,7 +898,7 @@ checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848"
[[package]] [[package]]
name = "rllama" name = "rllama"
version = "0.1.0" version = "0.3.0"
dependencies = [ dependencies = [
"approx", "approx",
"clap 4.1.10", "clap 4.1.10",

@ -1,8 +1,16 @@
# RLLaMA # RLLaMA
This is my attempt at making the LLaMA language model working on a pure Rust RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).
CPU implementation. I was inspired by an amazing CPU implementation here:
https://github.com/ggerganov/ggml that could run GPT-J 6B models. ## Supported features
* Use either `f16` and `f32` weights.
* LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B
likely works but I haven't found a big enough computer to run it.
* Multithreaded hand-optimized CPU inference
* OpenCL support for GPU inference.
## Performance
The current performance is as follows: The current performance is as follows:
@ -24,20 +32,16 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1232ms / token (Open
LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token (OpenCL on CPU) LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X: 4098ms / token (OpenCL on CPU)
``` ```
(Scroll to the bottom to see benchmarks over time). Scroll to the bottom of this README.md to see benchmarks over time.
I have not tried to run LLaMA-60B but I think it would work if you got a big
enough computer.
It also has a Python unpickler that understands the `.pth` files used by ## Screenshot
PyTorch. Well almost, it doesn't unzip them automatically (see below).
The implementation uses AVX2, even in the OpenCL codepath, so this will only ![Screenshot of RLLaMA in action](rllama.gif)
run on AMD64 at this time.
# Crates.io Cargo package install ## Install
As of March 18, `rllama` is on `crates.io`. You can install it with `cargo install rllama`. You may need to explicitly enable AVX2 features: You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you
likely need to enable them to install the executable.
``` ```
RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
@ -46,55 +50,60 @@ RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
There is a `.cargo/config.toml` inside this repository that will enable these There is a `.cargo/config.toml` inside this repository that will enable these
features if you install manually from this Git repository instead. features if you install manually from this Git repository instead.
# How to run ## LLaMA weights
You will need Rust. Make sure you can run `cargo` from a command line. In Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
particular, this is using unstable features so you need nightly rust. Make sure approved to get weights.
that if you write `cargo --version` it shows that it is nightly Rust.
You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/ For LLaMA-7B make sure, you got these files:
Once you have 7B weights, and the `tokenizer.model` it comes with, you need to ```shell
decompress it. * 7B/consolidated.00.pth
* 7B/params.json
* tokenizer.model
```
The `consolidated.00.pth` is actually a zip file. You need to unzip it:
```shell ```shell
$ cd LLaMA
$ cd 7B $ cd 7B
$ unzip consolidated.00.pth $ unzip consolidated.00.pth
# For LLaMA-7B, rename consolidated to consolidated.00
# For the larger models, the number is there already so no need to do this step.
$ mv consolidated consolidated.00 $ mv consolidated consolidated.00
``` ```
You should then be ready to generate some text. If you are using a larger model like LLaMA-13B, then you can skip the last step
of renaming the `consolidated` directory.
```shell You should now be ready to generate some text.
cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
```
By default, it will use the weights in the precision they are in the source ## Example
files. You can use `--f16` command line argument to cast the largest weight
matrices to float16. Also, using OpenCL will also cast the weight matrices to
float16.
You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler Run LLaMA-7B with some weights casted to 16-bit floats:
settings.
There is `--repetition-penalty` setting. 1.0 means no penalty. This value ```shell
likely should be between 0 and 1. Values smaller than 1.0 give a penalty to rllama --tokenizer-model /path/to/tokenizer.model \
tokens that appear in the context, by --model-path /path/to/LLaMA/7B \
`x*(repetitition_penalty^num_occurrences)` before applying `softmax()` on the --param-path /path/to/LLaMA/7B/params.json \
output probabilities. Or in other words, values smaller than 1.0 apply penalty. --f16 \
--prompt "The meaning of life is"
```
You can also use `--prompt-file` to read the prompt from a file instead from Use `rllama --help` to see all the options.
the command line.
# How to turn on OpenCL ## How to turn on OpenCL
Use `opencl` Cargo feature. Use `opencl` Cargo feature.
``` ```
cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is" RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
```
```
rllama --tokenizer-model /path/to/tokenizer.model \
--model-path /path/to/LLaMA/7B \
--param-path /path/to/LLaMA/7B/params.json \
--opencl-device 0 \
--prompt "The meaning of life is"
``` ```
With `opencl` feature, there is also another argument, `--opencl-device` that With `opencl` feature, there is also another argument, `--opencl-device` that
@ -102,11 +111,9 @@ takes a number. That number selects Nth OpenCL device found on the system. You
can see the devices in the output when you run the program (e.g. see the can see the devices in the output when you run the program (e.g. see the
screenshot below). screenshot below).
# Screenshot Weights are always cast to 16-bit floats for OpenCL.
![Screenshot of RLLaMA in action](rllama.png)
# Notes and future plans ## Notes and future plans
This is a hobby thing for me so don't expect updates or help. This is a hobby thing for me so don't expect updates or help.
@ -126,7 +133,7 @@ This is a hobby thing for me so don't expect updates or help.
* Stanford released some instruct-finetuned LLaMA-7B, once I find the weights * Stanford released some instruct-finetuned LLaMA-7B, once I find the weights
then I'd like to try make a chat-like command-line interface. then I'd like to try make a chat-like command-line interface.
# Benchmarks ## Benchmarks
I'm trying to track that I'm making this faster and not slower. I'm trying to track that I'm making this faster and not slower.

Binary file not shown.

After

Width:  |  Height:  |  Size: 843 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 484 KiB

Loading…
Cancel
Save