There is a `.cargo/config.toml` inside this repository that will enable these
features if you install manually from this Git repository instead.
# How to run
## LLaMA weights
You will need Rust. Make sure you can run `cargo` from a command line. In
particular, this is using unstable features so you need nightly rust. Make sure
that if you write `cargo --version` it shows that it is nightly Rust.
Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
approved to get weights.
You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/
For LLaMA-7B make sure, you got these files:
Once you have 7B weights, and the `tokenizer.model` it comes with, you need to
decompress it.
```shell
* 7B/consolidated.00.pth
* 7B/params.json
* tokenizer.model
```
The `consolidated.00.pth` is actually a zip file. You need to unzip it:
```shell
$ cd LLaMA
$ cd 7B
$ unzip consolidated.00.pth
# For LLaMA-7B, rename consolidated to consolidated.00
# For the larger models, the number is there already so no need to do this step.
$ mv consolidated consolidated.00
```
You should then be ready to generate some text.
If you are using a larger model like LLaMA-13B, then you can skip the last step
of renaming the `consolidated` directory.
```shell
cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
```
You should now be ready to generate some text.
By default, it will use the weights in the precision they are in the source
files. You can use `--f16` command line argument to cast the largest weight
matrices to float16. Also, using OpenCL will also cast the weight matrices to
float16.
## Example
You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler
settings.
Run LLaMA-7B with some weights casted to 16-bit floats:
There is `--repetition-penalty` setting. 1.0 means no penalty. This value
likely should be between 0 and 1. Values smaller than 1.0 give a penalty to
tokens that appear in the context, by
`x*(repetitition_penalty^num_occurrences)` before applying `softmax()` on the
output probabilities. Or in other words, values smaller than 1.0 apply penalty.
You can also use `--prompt-file` to read the prompt from a file instead from
the command line.
Use `rllama --help` to see all the options.
# How to turn on OpenCL
## How to turn on OpenCL
Use `opencl` Cargo feature.
```
cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"