Update README.md, add a nice animation.

3 years ago · db0f22ed26
parent cfad4b1205
commit db0f22ed26
4 changed files with 55 additions and 48 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@ -898,7 +898,7 @@ checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848"
 [[package]]
 name = "rllama"
-version = "0.1.0"
+version = "0.3.0"
 dependencies = [
 "approx",
 "clap 4.1.10",
--- a/README.md
+++ b/README.md
@ -1,8 +1,16 @@
 # RLLaMA
-This is my attempt at making the LLaMA language model working on a pure Rust
+RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).
-CPU implementation. I was inspired by an amazing CPU implementation here:
+
-https://github.com/ggerganov/ggml that could run GPT-J 6B models.
+## Supported features
  * Use either `f16` and `f32` weights.
  * LLaMA-7B, LLaMA-13B and LLaMA-30B are all confirmed working. LLaMA-65B
    likely works but I haven't found a big enough computer to run it.
  * Multithreaded hand-optimized CPU inference
  * OpenCL support for GPU inference.
 ## Performance
 The current performance is as follows:
@ -24,20 +32,16 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token           (Open
 LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token           (OpenCL on CPU)
 ```
-(Scroll to the bottom to see benchmarks over time).
+Scroll to the bottom of this README.md to see benchmarks over time.
 I have not tried to run LLaMA-60B but I think it would work if you got a big
 enough computer.
-It also has a Python unpickler that understands the `.pth` files used by
+## Screenshot
 PyTorch. Well almost, it doesn't unzip them automatically (see below).
-The implementation uses AVX2, even in the OpenCL codepath, so this will only
+![Screenshot of RLLaMA in action](rllama.gif)
 run on AMD64 at this time.
-# Crates.io Cargo package install
+## Install
-As of March 18, `rllama` is on `crates.io`. You can install it with `cargo install rllama`. You may need to explicitly enable AVX2 features:
+You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you
 likely need to enable them to install the executable.
 ```
 RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
@ -46,55 +50,60 @@ RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
 There is a `.cargo/config.toml` inside this repository that will enable these
 features if you install manually from this Git repository instead.
-# How to run
+## LLaMA weights
-You will need Rust. Make sure you can run `cargo` from a command line. In
+Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
-particular, this is using unstable features so you need nightly rust. Make sure
+approved to get weights.
 that if you write `cargo --version` it shows that it is nightly Rust.
-You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/
+For LLaMA-7B make sure, you got these files:
-Once you have 7B weights, and the `tokenizer.model` it comes with, you need to
+```shell
-decompress it.
+* 7B/consolidated.00.pth
 * 7B/params.json
 * tokenizer.model
 ```
 The `consolidated.00.pth` is actually a zip file. You need to unzip it:
 ```shell
 $ cd LLaMA
 $ cd 7B
 $ unzip consolidated.00.pth
 # For LLaMA-7B, rename consolidated to consolidated.00
 # For the larger models, the number is there already so no need to do this step.
 $ mv consolidated consolidated.00
 ```
-You should then be ready to generate some text.
+If you are using a larger model like LLaMA-13B, then you can skip the last step
 of renaming the `consolidated` directory.
-```shell
+You should now be ready to generate some text.
 cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
 ```
-By default, it will use the weights in the precision they are in the source
+## Example
 files. You can use `--f16` command line argument to cast the largest weight
 matrices to float16. Also, using OpenCL will also cast the weight matrices to
 float16.
-You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler
+Run LLaMA-7B with some weights casted to 16-bit floats:
 settings.
-There is `--repetition-penalty` setting. 1.0 means no penalty. This value
+```shell
-likely should be between 0 and 1. Values smaller than 1.0 give a penalty to
+rllama --tokenizer-model /path/to/tokenizer.model \
-tokens that appear in the context, by
+       --model-path /path/to/LLaMA/7B \
-`x*(repetitition_penalty^num_occurrences)` before applying `softmax()` on the
+       --param-path /path/to/LLaMA/7B/params.json \
-output probabilities. Or in other words, values smaller than 1.0 apply penalty.
+       --f16 \
       --prompt "The meaning of life is"
 ```
-You can also use `--prompt-file` to read the prompt from a file instead from
+Use `rllama --help` to see all the options.
 the command line.
-# How to turn on OpenCL
+## How to turn on OpenCL
 Use `opencl` Cargo feature.
 ```
-cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
+RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
 ```
 ```
 rllama --tokenizer-model /path/to/tokenizer.model \
       --model-path /path/to/LLaMA/7B \
       --param-path /path/to/LLaMA/7B/params.json \
       --opencl-device 0 \
       --prompt "The meaning of life is"
 ```
 With `opencl` feature, there is also another argument, `--opencl-device` that
@ -102,11 +111,9 @@ takes a number. That number selects Nth OpenCL device found on the system. You
 can see the devices in the output when you run the program (e.g. see the
 screenshot below).
-# Screenshot
+Weights are always cast to 16-bit floats for OpenCL.
 ![Screenshot of RLLaMA in action](rllama.png)
-# Notes and future plans
+## Notes and future plans
 This is a hobby thing for me so don't expect updates or help.
@ -126,7 +133,7 @@ This is a hobby thing for me so don't expect updates or help.
 * Stanford released some instruct-finetuned LLaMA-7B, once I find the weights
  then I'd like to try make a chat-like command-line interface.
-# Benchmarks
+## Benchmarks
 I'm trying to track that I'm making this faster and not slower.
--- a/rllama.gif
+++ b/rllama.gif
--- a/rllama.png
+++ b/rllama.png