Add skeleton code for 4-bit quantization.

The type is now recognized and I have a very simple quantizer too but no operations are done yet.
Add a flag that will exit the HTTP server after just one query.
19 changed files with 3189 additions and 527 deletions
--- a/.cargo/config
+++ b/.cargo/config
@ -0,0 +1,2 @@
+[build]
+rustflags = ["-C", "target-feature=+avx2,+avx,+sse,+fma"]
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@ -1,7 +1,15 @@
 [package]
 name = "rllama"
-version = "0.1.0"
+version = "0.3.0"
 edition = "2021"
+authors = ["Mikko Juola"]
+description = "Pure Rust implementation of LLaMA-family of models, executable"
+documentation = "https://github.com/Noeda/rllama"
+homepage = "https://github.com/Noeda/rllama"
+repository = "https://github.com/Noeda/rllama"
+license = "AGPL-3.0"
+keywords = ["llama", "machine-learning"]
+categories = ["command-line-utilities"]

 [lib]
 path = "src/lib.rs"
@ -24,10 +32,14 @@ indicatif = "0.17"
 colored = "2"
 serde = { version = "1", features = ["derive"] }
 serde_json = "1"
-ocl = "0.19"
+mimalloc = "0.1"
+ocl = { version = "0.19", optional = true }
+rocket = { version = "0.4", features = ["sse"], optional = true }
+lazy_static = "1.4"

 [features]
-opencl = []
+opencl = ["ocl"]
+server = ["rocket"]

 # We need protobuf compiler
 [build-dependencies]
@ -38,6 +50,7 @@ protobuf-parse = "3.2"
 criterion = "0.4"

 [profile.release]
+panic = 'abort'
 debug = true

 [[bench]]
--- a/README.md
+++ b/README.md
@ -1,82 +1,234 @@
 # RLLaMA

-This is my attempt at making the LLaMA language model working on a pure Rust
-CPU implementation. I was inspired by an amazing CPU implementation here:
-https://github.com/ggerganov/ggml that could run GPT-J 8B models.
+RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).

-As of writing of this, this can run LLaMA-7B at around ~1 token per second, on
-a Ryzen 3950X using something like 1.5 threads because I haven't yet properly
-figured out how to multithread this.
+## Supported features

-I've also managed to run LLaMA-13B which just barely fits in my 64-gig machine
-with 32-bit float weights everywhere.
+  * Uses either `f16` and `f32` weights.
+  * LLaMA-7B, LLaMA-13B, LLaMA-30B, LLaMA-65B all confirmed working
+  * Hand-optimized AVX2 implementation
+  * OpenCL support for GPU inference.
+  * Simple HTTP API support, with the possibility of doing token sampling on
+    client side

-LLaMA-30B technically runs but my computer does not have enough memory to keep
-all the weights around so generating a token takes minutes.
+## Performance

-I have not tried LLaMA-60B but presumably if all the smaller models work it
-would run given a sufficiently chonky computer.
+The current performance is as follows:

-This uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family
-CPU to run this.
+```
+Pure Rust implementations:
+
+LLaMA-7B:  AMD Ryzen 3950X:                       552ms / token     f16    (pure Rust)
+LLaMA-7B:  AMD Ryzen 3950X:                       1008ms / token    f32    (pure Rust)
+LLaMA-13B: AMD Ryzen 3950X:                       1029ms / token    f16    (pure Rust)
+LLaMA-13B: AMD Ryzen 3950X:                       1930ms / token    f32    (pure Rust)
+LLaMA-30B: AMD Ryzen 5950X:                       2112ms / token    f16    (pure Rust)
+LLaMA-65B: AMD Ryzen 5950X:                       4186ms / token    f16    (pure Rust)
+
+OpenCL (all use f16):
+
+LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  247ms / token            (OpenCL on GPU)
+LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token            (OpenCL on CPU)
+LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <I ran out of GPU memory :(>
+LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token           (OpenCL on CPU)
+LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token           (OpenCL on CPU)
+```

-It also has a Python unpickler that understands the `.pth` files used by
-PyTorch. Well almost, it doesn't unzip them automatically (see below).
+Scroll to the bottom of this README.md to see benchmarks over time.

-# How to run
+## Screenshot

-You will need Rust. Make sure you can run `cargo` from a command line. In
-particular, this is using unstable features so you need nightly rust. Make sure
-if you write `cargo --version` it is nightly.
+![Screenshot of RLLaMA in action](rllama.gif)

-You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/
+## Install
+
+You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you
+likely need to enable them to install the executable.
+
+```
+RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
+```

-Once you have 7B weights, and the `tokenizer.model` it comes with, you need to
-decompress it.
+There is a `.cargo/config.toml` inside this repository that will enable these
+features if you install manually from this Git repository instead.
+
+## LLaMA weights
+
+Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
+approved to get weights.
+
+For LLaMA-7B make sure, you got these files:
+
+```shell
+* 7B/consolidated.00.pth
+* 7B/params.json
+* tokenizer.model
+```
+
+The `consolidated.00.pth` is actually a zip file. You need to unzip it:

 ```shell
-$ cd LLaMA
 $ cd 7B
 $ unzip consolidated.00.pth
-# For LLaMA-7B, rename consolidated to consolidated.00
-# For the larger models, the number is there already so no need to do this step.
 $ mv consolidated consolidated.00
 ```

-You should then be ready to generate some text.
+If you are using a larger model like LLaMA-13B, then you can skip the last step
+of renaming the `consolidated` directory.
+
+You should now be ready to generate some text.
+
+## Example
+
+Run LLaMA-7B with some weights casted to 16-bit floats:

 ```shell
-cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
+rllama --tokenizer-model /path/to/tokenizer.model \
+       --model-path /path/to/LLaMA/7B \
+       --param-path /path/to/LLaMA/7B/params.json \
+       --f16 \
+       --prompt "The meaning of life is"
 ```

-Right now it seems to use around ~25 gigabytes of memory for 7B and around ~50
-gigabytes for 13B. If you don't use OpenCL, then all parameters are cast to
-32-bit floats.
+Use `rllama --help` to see all the options.
+
+## Inference server
+
+`rllama` can run in an inference server mode with a simple HTTP JSON API. You
+need to enable `server` features for this.
+
+```
+cargo build --release --features server
+```
+
+The command line flags for this are:
+
+  * `--inference-server` using this will turn on the inference server.
+  * `--inference-server-port` sets the port. Default port is 8080.
+  * `--inference-server-host` sets the host. The default host is 127.0.0.1.
+  * `--inference-server-max-concurrent-inferences` sets how many concurrent
+    requests are allowed to be actively doing inference at the same time. The
+    default is 5.
+  * `--inference-server-api-path` sets which path servers the API requests. The
+    default path is `/rllama/v1/inference`
+  * `--inference-server-prompt-cache-size` sets how many previous prompt
+    calculations should be cached. Default is 50. This speeds up token
+    generation for prompts that were already requested before, however it also
+    increases memory use as the cache gets more full.
+  * `--inference-server-exit-after-one-query` will make the server exit with
+    exit code 0 after it has served one HTTP query. This is used for
+    troubleshooting and experiments.

-You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler
-settings.
+Prompts and flags related to token sampling are all ignored in inference server
+mode. Instead, they are obtained from each HTTP JSON API request.

-# Future plans
+### Inference server API
+
+There is an `examples/api_hello_world.py` for a minimal API use example.
+
+```
+POST /rllama/v1/inference
+```
+
+Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.
+
+The expected JSON is as follows:
+
+```
+  {
+     "temperature":        <number, optional>
+     "top_k":              <integer, optional, default 20>
+     "top_p":              <number, optional, default: 1.0>
+     "repetition_penalty": <number, optional, default: 1.0>
+     "stop_at_end_token":  <bool, optional, default: true>
+     "max_seq_len":        <integer, optional, default: 1024. Clamped to
+                            be at highest the same as --max-seq-len command line option.>
+     "max_new_tokens":     <integer, optional, default: 1024>
+     "no_token_sampling":  <bool, optional, default: false>
+     "prompt":             <string, required>
+  }
+```
+
+The form of the response depends on if `no_token_sampling` is set to true or false. The
+response is in JSONL, i.e. multiple JSON dictionaries, separated by newlines.
+
+`no_token_sampling` can turn off `rllama`'s own token sampling. In this case,
+the probabilities for every token are returned instead.
+
+When no\_token\_sampling = false:
+
+```
+{<token string>: {"p": <number>, "is_end_token": bool, might not be present}}
+```
+
+  * `token` contains the new token to be appended to output. It does not
+    include string you fed to the system originally.
+  * `p` is the probability that this token was chosen. For example, if this
+    value is 0.1, it means that this particular token had 10% chance of being
+    selected with the current token sampling settings.
+  * `is_end_token` is `true` is the given token signifies end of output. This
+    field is not present otherwise.
+
+When no\_token\_sampling = true:
+
+```
+{<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
+,<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
+,...}
+```
+
+If you want to implement your own token sampling, you may want to set
+`max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own
+sampling behavior entirely.
+
+`rllama` internally caches recently queried prompts and the intermediate
+computations so that it's able to continue off quickly if you issue a query
+that is either the same as a previous query or a continuation of one.
+
+## How to turn on OpenCL
+
+Use `opencl` Cargo feature.
+
+```
+RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
+```
+
+```
+rllama --tokenizer-model /path/to/tokenizer.model \
+       --model-path /path/to/LLaMA/7B \
+       --param-path /path/to/LLaMA/7B/params.json \
+       --opencl-device 0 \
+       --prompt "The meaning of life is"
+```
+
+With `opencl` feature, there is also another argument, `--opencl-device` that
+takes a number. That number selects Nth OpenCL device found on the system. You
+can see the devices in the output when you run the program (e.g. see the
+screenshot below).
+
+Weights are always cast to 16-bit floats for OpenCL.
+
+## Notes and future plans

 This is a hobby thing for me so don't expect updates or help.

 * Some other CPU implementations use quantization to reduce the size of weights
-* Put some of the operations on the OpenCL GPU/CPU. I've made some initial
-  OpenCL code but it is not used in the transformer loop yet. The CPU OpenCL
-  improves my own AVX2 code by like 100% and massively so on GPU although I am
-  also like 20x slower than equivalent operation on PyTorch on the same GPU.
+  and generally speed up everything a lot. `rllama` does not have this.
 * I've heard there is some thing called Tensor Cores on nVidia GPUs. Not
  accessible with OpenCL. But might be accessible on Vulkan with a an
  extension.
 * More sophisticated token sampling. I saw on Hackernews some comments how the
-  samplers are kinda garbage and you can get much better results with good
-  defaults and things like repetition penalty.
+  samplers included in Facebook's reference code are kinda garbage and you can
+  get much better results with good defaults and things like repetition
+  penalty.
 * There is an initial start-up time as the program has to pass through the
  initial prompt. I don't know if this start-up time can be eliminated
  completely but it could be cached on disk. Use cases like having a standard
  prompt to prime the text generation that you reuse many times.
+* Stanford released some instruct-finetuned LLaMA-7B, once I find the weights
+  then I'd like to try make a chat-like command-line interface.

-# Benchmarks
+## Benchmarks

 I'm trying to track that I'm making this faster and not slower.

@ -90,11 +242,13 @@ cargo run --release --
          --prompt "Computers are pretty complica" --max-seq-len 50

 # commit c9c861d199bd2d87d7e883e3087661c1e287f6c4  (13 March 2023)
+
 LLaMA-7B:  AMD Ryzen 3950X: 1058ms / token
 LLaMA-13B: AMD Ryzen 3950X: 2005ms / token

 # commit 63d27dba9091823f8ba11a270ab5790d6f597311  (13 March 2023)
 # This one has one part of the transformer moved to GPU as a type of smoke test
+
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  567ms / token
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  956ms / token
 LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  987ms / token
@ -102,8 +256,33 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1706ms / token

 # commit 35b0c372a87192761e17beb421699ea5ad4ac1ce  (13 March 2023)
 # I moved some attention stuff to OpenCL too.
+
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  283ms / token
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  679ms / token
 LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <ran out of GPU memory>
 LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1226ms / token
+
+# commit de5dd592777b3a4f5a9e8c93c8aeef25b9294364  (15 March 2023)
+# The matrix multiplication on GPU is now much faster. It didn't have that much
+# effect overall though, but I got modest improvement on LLaMA-7B GPU.
+
+LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  247ms / token
+LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token
+LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <ran out of GPU memory>
+LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token
+LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token
+
+# commit 3d0afcf24309f28ec540ed7645c35400a865ad6f  (17 March 2023)
+# I've been focusing on making the ordinary non-OpenCL CPU implementation
+# faster and I got some gains, most importantly from multithreading.
+# There is Float16 support now, so I've added f16/f32 to these tables:
+#
+# I also managed to run LLaMA-65B for the first time.
+
+LLaMA-7B:  AMD Ryzen 3950X: 552ms / token     f16
+LLaMA-7B:  AMD Ryzen 3950X: 1008ms / token    f32
+LLaMA-13B: AMD Ryzen 3950X: 1029ms / token    f16
+LLaMA-13B: AMD Ryzen 3950X: 1930ms / token    f32
+LLaMA-30B: AMD Ryzen 5950X: 2112ms / token    f16
+LLaMA-65B: AMD Ryzen 5950X: 4186ms / token    f16
 ```
--- a/examples/api_hello_world.py
+++ b/examples/api_hello_world.py
@ -0,0 +1,25 @@
+#!/usr/bin/env python3
+
+"""
+This script uses the rllama API to generate tokens.
+
+It does not print the tokens nicely.
+"""
+
+import requests
+
+def main():
+    url = 'http://127.0.0.1:8080/rllama/v1/inference'
+    req = {
+        'prompt': 'Hello world!',
+        'max_seq_len': 1024,
+        'max_new_tokens': 200,
+        'no_token_sampling': False
+    }
+    res = requests.post(url, json=req, stream=True)
+    for line in res.iter_lines():
+        print(line.decode('utf-8'))
+
+
+if __name__ == '__main__':
+    main()
--- a/rllama.gif
+++ b/rllama.gif
--- a/src/benches/benchmark.rs
+++ b/src/benches/benchmark.rs
@ -14,16 +14,34 @@ pub fn opencl_benchmarks(c: &mut Criterion) {
    let cl = OpenCL::new(false, 0).unwrap();

    let mut mul_left = Tensor::random(1024, 1024, TensorDType::Float16);
-    mul_left.to_gpu(&cl).unwrap();
+    mul_left.to_gpu_inplace(&cl).unwrap();
    let mut mul_right = Tensor::random(1024, 1024, TensorDType::Float16);
-    mul_right.to_gpu(&cl).unwrap();
+    mul_right.to_gpu_inplace(&cl).unwrap();
    let mut mul_target = Tensor::zeros(1024, 1024, TensorDType::Float16);
-    mul_target.to_gpu(&cl).unwrap();
+    mul_target.to_gpu_inplace(&cl).unwrap();

    let mut mul_left_cpu = Tensor::random(1024, 1024, TensorDType::Float32);
    let mut mul_right_cpu = Tensor::random(1024, 1024, TensorDType::Float32);
    let mut mul_target_cpu = Tensor::random(1024, 1024, TensorDType::Float32);

+    let mut mul_left1 = Tensor::random(4096, 11000, TensorDType::Float16);
+    let mut mul_right1 = Tensor::random(1, 11000, TensorDType::Float16);
+    let mut mul_target1 = Tensor::zeros(4096, 1, TensorDType::Float16);
+    mul_left1.to_gpu_inplace(&cl).unwrap();
+    mul_right1.to_gpu_inplace(&cl).unwrap();
+    mul_target1.to_gpu_inplace(&cl).unwrap();
+
+    c.bench_function(
+        "4096x11000 to 1x11000 matrix multiplication transposed on OpenCL",
+        |b| {
+            b.iter(|| {
+                mul_target1
+                    .matrix_mul_inplace_transposed(black_box(&mul_left1), black_box(&mul_right1));
+                mul_target1.finish();
+            })
+        },
+    );
+
    c.bench_function(
        "1024x1024 matrix multiplication transposed on OpenCL",
        |b| {
@ -43,24 +61,24 @@ pub fn opencl_benchmarks(c: &mut Criterion) {

    c.bench_function("1x1 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig1.to_gpu(&cl).unwrap();
-            let _ = orig1.to_cpu();
+            let _ = orig1.to_gpu_inplace(&cl).unwrap();
+            let _ = orig1.to_cpu_inplace();
            orig1.finish();
        })
    });

    c.bench_function("1024x1024 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig16.to_gpu(&cl).unwrap();
-            let _ = orig16.to_cpu();
+            let _ = orig16.to_gpu_inplace(&cl).unwrap();
+            let _ = orig16.to_cpu_inplace();
            orig16.finish();
        })
    });

    c.bench_function("4096x4096 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig32.to_gpu(&cl).unwrap();
-            let _ = orig32.to_cpu();
+            let _ = orig32.to_gpu_inplace(&cl).unwrap();
+            let _ = orig32.to_cpu_inplace();
            orig32.finish();
        })
    });
@ -81,27 +99,44 @@ pub fn tensor_benchmarks(c: &mut Criterion) {
    let orig_84096_2 = Tensor::zeros(4096, 4096, TensorDType::Float32);
    let mut result_84096 = Tensor::zeros(8, 4096, TensorDType::Float32);

+    let orig_84096_1_f16 = Tensor::zeros(8, 4096, TensorDType::Float16);
+    let orig_84096_2_f16 = Tensor::zeros(4096, 4096, TensorDType::Float16);
+    let mut result_84096_f16 = Tensor::zeros(8, 4096, TensorDType::Float16);
+
    let orig_f32 = Tensor::zeros(1024, 1024, TensorDType::Float32);
    let orig_f16 = Tensor::zeros(1024, 1024, TensorDType::Float16);

-    c.bench_function("1024x1024 matrix from f32->f16", |b| {
-        b.iter(|| {
-            let _ = black_box(&orig_f32).to_f16();
-        })
-    });
+    let m1 = Tensor::random(1024, 128, TensorDType::Float32);
+    let m2 = Tensor::random(1, 128, TensorDType::Float32);
+    let m1_f16 = m1.to_f16();
+    let m2_f16 = m2.to_f16();

-    c.bench_function("1024x1024 matrix from f16->f32", |b| {
-        b.iter(|| {
-            let _ = black_box(&orig_f16).to_f32();
-        })
-    });
+    c.bench_function(
+        "1024x128 * 1x128 matrix vector transposed multiplication, f32",
+        |b| {
+            b.iter(|| {
+                let _ = m1.matrix_vector_mul_transposed(black_box(&m2));
+            })
+        },
+    );

    c.bench_function(
-        "matrix multiplication 8x4096 @ 4096x4096 f32 in-place",
+        "1024x128 * 1x128 matrix vector transposed multiplication, f16",
        |b| {
            b.iter(|| {
-                let _ = result_84096
-                    .matrix_mul_inplace(black_box(&orig_84096_1), black_box(&orig_84096_2));
+                let _ = m1_f16.matrix_vector_mul_transposed(black_box(&m2_f16));
+            })
+        },
+    );
+
+    c.bench_function(
+        "matrix multiplication 8x4096 @ 4096x4096 f16 in-place, transposed",
+        |b| {
+            b.iter(|| {
+                let _ = result_84096_f16.matrix_mul_inplace_transposed(
+                    black_box(&orig_84096_1_f16),
+                    black_box(&orig_84096_2_f16),
+                );
            })
        },
    );
@ -118,6 +153,28 @@ pub fn tensor_benchmarks(c: &mut Criterion) {
        },
    );

+    c.bench_function(
+        "matrix multiplication 8x4096 @ 4096x4096 f32 in-place",
+        |b| {
+            b.iter(|| {
+                let _ = result_84096
+                    .matrix_mul_inplace(black_box(&orig_84096_1), black_box(&orig_84096_2));
+            })
+        },
+    );
+
+    c.bench_function("1024x1024 matrix from f32->f16", |b| {
+        b.iter(|| {
+            let _ = black_box(&orig_f32).to_f16();
+        })
+    });
+
+    c.bench_function("1024x1024 matrix from f16->f32", |b| {
+        b.iter(|| {
+            let _ = black_box(&orig_f16).to_f32();
+        })
+    });
+
    c.bench_function("matrix multiplication f32 not in-place", |b| {
        b.iter(|| {
            let _ = black_box(&orig32_1).matrix_mul(black_box(&orig32_2));
--- a/src/lib.rs
+++ b/src/lib.rs
@ -1,8 +1,11 @@
 #![feature(stdsimd)]
+#![feature(decl_macro)]

 pub mod embedding;
 pub mod protomodels;
 pub mod rllama_main;
+pub mod semaphore;
+pub mod simd_support;
 pub mod tensor;
 #[cfg(feature = "opencl")]
 pub mod tensor_opencl_support;
@ -10,3 +13,7 @@ pub mod token_sampler;
 pub mod tokenizer;
 pub mod transformer;
 pub mod unpickler;
+pub mod weight_compression;
+#[cfg(feature = "server")]
+#[macro_use]
+extern crate rocket;
--- a/src/main.rs
+++ b/src/main.rs
@ -1,3 +1,17 @@
+#[cfg(not(target_feature = "avx2"))]
+compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
+#[cfg(not(target_feature = "sse2"))]
+compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
+#[cfg(not(target_feature = "fma"))]
+compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
+#[cfg(not(target_feature = "avx"))]
+compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
+
+use mimalloc::MiMalloc;
+
+#[global_allocator]
+static GLOBAL: MiMalloc = MiMalloc;
+
 pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    rllama::rllama_main::main()
 }
--- a/src/rllama_main.rs
+++ b/src/rllama_main.rs
@ -1,18 +1,23 @@
 use crate::embedding::Embedding;
+use crate::semaphore::Semaphore;
 #[cfg(feature = "opencl")]
 use crate::tensor_opencl_support::OpenCL;
 use crate::token_sampler::TokenSampler;
 use crate::tokenizer::{TokenId, Tokenizer};
-use crate::transformer::{DataSettings, Transformer};
+use crate::transformer::{DataSettings, Transformer, TransformerCaches};
 use crate::unpickler;
 use crate::unpickler::Value;
 use clap::Parser;
 use colored::Colorize;
+#[cfg(feature = "server")]
+use rocket::{response::status, response::Stream, Data, State};
 use serde::{Deserialize, Serialize};
+use std::collections::BTreeMap;
 use std::io::{Read, Write};
 use std::path::PathBuf;
+use std::sync::{Arc, RwLock};

-#[derive(Parser)]
+#[derive(Parser, Clone)]
 #[command(author, version, about, long_about = None)]
 struct Cli {
    #[arg(long)]
@ -22,6 +27,9 @@ struct Cli {
    #[arg(long)]
    param_path: String,

+    #[arg(short, long, action)]
+    quiet: bool,
+
    #[arg(long)]
    prompt: Option<String>,
    #[arg(long)]
@ -36,10 +44,39 @@ struct Cli {
    top_p: Option<f32>,
    #[arg(long)]
    top_k: Option<i32>,
+    #[arg(long)]
+    repetition_penalty: Option<f32>,
+
+    #[arg(long)]
+    max_threads: Option<usize>,
+
+    #[arg(long, action)]
+    f16: bool,

    #[cfg(feature = "opencl")]
    #[arg(long)]
    opencl_device: Option<usize>,
+
+    #[arg(long, action)]
+    inference_server: bool,
+
+    #[arg(long)]
+    inference_server_port: Option<u16>,
+
+    #[arg(long)]
+    inference_server_host: Option<String>,
+
+    #[arg(long)]
+    inference_server_max_concurrent_inferences: Option<usize>,
+
+    #[arg(long)]
+    inference_server_api_path: Option<String>,
+
+    #[arg(long)]
+    inference_server_prompt_cache_size: Option<usize>,
+
+    #[arg(long, action)]
+    inference_server_exit_after_one_query: bool,
 }

 #[derive(Clone, Serialize, Deserialize)]
@ -54,14 +91,46 @@ struct ModelParams {

 pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cli = Cli::parse();
-    let model_path = cli.model_path;
-    let tokenizer_path = cli.tokenizer_path;
-    let param_path = cli.param_path;
+    let model_path = cli.model_path.clone();
+    let tokenizer_path = cli.tokenizer_path.clone();
+    let param_path = cli.param_path.clone();
+
+    #[cfg(not(feature = "server"))]
+    if cli.inference_server {
+        eprintln!("Inference server is not enabled in this build.");
+        return Err("Inference server is not enabled in this build.".into());
+    }
+
+    let max_threads: usize = match cli.max_threads {
+        None => rayon::current_num_threads(),
+        Some(max_threads) => {
+            rayon::ThreadPoolBuilder::new()
+                .num_threads(max_threads)
+                .build_global()
+                .unwrap();
+            max_threads
+        }
+    };

    let mut be_quiet: bool = false;
    if !colored::control::SHOULD_COLORIZE.should_colorize() {
        be_quiet = true;
    }
+    if cli.quiet {
+        be_quiet = true;
+    }
+    if be_quiet {
+        colored::control::SHOULD_COLORIZE.set_override(false);
+    }
+
+    // Custom println-like macro that respects be_quiet
+    macro_rules! pln {
+        ($($arg:tt)*) => {
+            if !be_quiet {
+                std::println!($($arg)*);
+            }
+        };
+    }

    #[cfg(feature = "opencl")]
    let opencl: Option<OpenCL> = {
@ -79,15 +148,6 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        }
    };

-    // Custom println-like macro that respects be_quiet
-    macro_rules! pln {
-        ($($arg:tt)*) => {
-            if !be_quiet {
-                std::println!($($arg)*);
-            }
-        };
-    }
-
    // Read ModelParams from param_path, we expect it to be JSON
    let mut fs = std::fs::File::open(&param_path)?;
    let mut bs = Vec::new();
@ -96,12 +156,12 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let params: ModelParams = serde_json::from_slice(&bs)?;
    pln!("Loaded model parameters from {}.", param_path);

-    let prompt: String = match (cli.prompt, cli.prompt_file) {
-        (Some(prompt), None) => {
+    let prompt: String = match (&cli.prompt, &cli.prompt_file) {
+        (Some(ref prompt), None) => {
            pln!("Using prompt: {}", prompt);
-            prompt
+            prompt.clone()
        }
-        (None, Some(prompt_file)) => {
+        (None, Some(ref prompt_file)) => {
            pln!("Using prompt file: {}", prompt_file);
            let mut fs = std::fs::File::open(prompt_file)?;
            let mut bs = Vec::new();
@ -110,8 +170,12 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
            String::from_utf8(bs)?
        }
        _ => {
-            eprintln!("Please provide either a prompt or a prompt file.");
-            return Err("Please provide either a prompt or a prompt file.".into());
+            if cli.inference_server {
+                "".to_string()
+            } else {
+                eprintln!("Please provide either a prompt or a prompt file.");
+                return Err("Please provide either a prompt or a prompt file.".into());
+            }
        }
    };

@ -150,12 +214,9 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    pln!("Loading embeddings from {}...", model_path);
    let emb = Embedding::from_unpickled(&unpickle_results, model_path.clone())?;

-    let max_seq_len = match cli.max_seq_len {
-        Some(max_seq_len) => max_seq_len,
-        None => 1024,
-    };
+    let max_seq_len = cli.max_seq_len.unwrap_or(1024);

-    let data_settings = {
+    let mut data_settings = {
        #[cfg(feature = "opencl")]
        {
            if let Some(opencl) = opencl {
@ -169,6 +230,10 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        DataSettings::new()
    };

+    if cli.f16 {
+        data_settings = data_settings.force_f16();
+    }
+
    pln!("Loading transformer weights from {}...", model_path);
    let tr = Transformer::from_unpickled(
        &unpickle_results,
@ -183,9 +248,455 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    )?;
    pln!("All is loaded. Starting inference.");

+    let tr: Arc<Transformer> = Arc::new(tr);
+    let tok: Arc<Tokenizer> = Arc::new(tok);
+
+    if cli.inference_server {
+        #[cfg(feature = "server")]
+        {
+            server_inference(cli, tr, tok, be_quiet, max_seq_len, params, max_threads)
+        }
+        #[cfg(not(feature = "server"))]
+        {
+            eprintln!("The inference server feature is not enabled.");
+            eprintln!("Please enable it with the \"inference-server\" feature.");
+            Err("The inference server feature is not enabled.".into())
+        }
+    } else {
+        command_line_inference(
+            cli.clone(),
+            tr.clone(),
+            tok.clone(),
+            prompt.clone(),
+            be_quiet,
+            max_seq_len,
+            params.clone(),
+            max_threads,
+        )
+    }
+}
+
+#[cfg(feature = "server")]
+fn server_inference(
+    cli: Cli,
+    tr: Arc<Transformer>,
+    tok: Arc<Tokenizer>,
+    be_quiet: bool,
+    max_seq_len: usize,
+    _params: ModelParams,
+    _max_threads: usize,
+) -> Result<(), Box<dyn std::error::Error>> {
+    macro_rules! pln {
+        ($($arg:tt)*) => {
+            if !be_quiet {
+                std::println!($($arg)*);
+            }
+        };
+    }
+
+    let inference_server_port = cli.inference_server_port.unwrap_or(8080);
+    let inference_server_host = cli
+        .inference_server_host
+        .clone()
+        .unwrap_or("127.0.0.1".to_string());
+    let inference_server_max_concurrent_inferences =
+        cli.inference_server_max_concurrent_inferences.unwrap_or(5);
+    let inference_server_api_path = cli
+        .inference_server_api_path
+        .clone()
+        .unwrap_or("/rllama/v1/inference".to_string());
+    let inference_server_prompt_cache_size = cli.inference_server_prompt_cache_size.unwrap_or(50);
+
+    pln!(
+        "Maximum concurrent inferences: {}",
+        inference_server_max_concurrent_inferences
+    );
+    pln!("Prompt cache size: {}", inference_server_prompt_cache_size);
+    pln!("Maximum sequence length: {}", max_seq_len);
+    pln!(
+        "--- Starting HTTP server on {}:{}, answering to requests at {} ---",
+        inference_server_host,
+        inference_server_port,
+        inference_server_api_path
+    );
+
+    // If there are too many connections, they will hang until they get their turn.
+    // Maybe can later implement return 503 slow down or something similar.
+    let concurrent_requests_semaphore = Semaphore::new(inference_server_max_concurrent_inferences);
+
+    let rocket_conf = rocket::Config::build(rocket::config::Environment::Production)
+        .address(inference_server_host)
+        .port(inference_server_port)
+        .finalize()
+        .unwrap();
+
+    let app = rocket::custom(rocket_conf)
+        .mount(&inference_server_api_path, routes![handle_request])
+        .manage(InferenceServerState {
+            transformer: tr,
+            tokenizer: tok,
+            max_seq_len,
+            concurrent_requests_semaphore,
+            attention_cache_repository: Arc::new(RwLock::new(AttentionCacheRepository::empty(
+                inference_server_prompt_cache_size,
+            ))),
+            exit_after_one_query: cli.inference_server_exit_after_one_query,
+        });
+
+    app.launch();
+    panic!("Starting web server failed.");
+}
+
+fn is_false(b: &bool) -> bool {
+    !b
+}
+
+#[derive(Serialize, Deserialize, Clone, Debug)]
+struct InferenceRequest {
+    temperature: Option<f32>,
+    top_k: Option<usize>,
+    top_p: Option<f32>,
+    repetition_penalty: Option<f32>,
+    max_seq_len: Option<usize>,
+    max_new_tokens: Option<usize>,
+    no_token_sampling: Option<bool>,
+    stop_at_end_token: Option<bool>,
+    prompt: String,
+}
+
+#[cfg(feature = "server")]
+#[derive(Serialize, Deserialize, Clone, Debug)]
+struct PredResult {
+    p: f32,
+    #[serde(skip_serializing_if = "is_false")]
+    is_end_token: bool,
+}
+
+#[cfg(feature = "server")]
+struct GeneratingSession {
+    transformer: Arc<Transformer>,
+    token_sampler: TokenSampler,
+    tokenizer: Arc<Tokenizer>,
+    attention_cache_repository: Arc<RwLock<AttentionCacheRepository>>,
+    tokens: Vec<TokenId>,
+    req_max_seq_len: usize,
+    req_max_new_tokens: usize,
+    new_tokens_generated: usize,
+    prev_pos: usize,
+    no_token_sampling: bool,
+    stop_at_end_token: bool,
+    sent_stuff_last_time: bool,
+    exit_after_one_query: bool,
+    result: Vec<u8>, // stores JSONL lines to be returned from read()
+}
+
+#[cfg(feature = "server")]
+impl GeneratingSession {
+    fn read_from_result(&mut self, buf: &mut [u8]) -> usize {
+        if !self.result.is_empty() {
+            if self.result.len() <= buf.len() {
+                for idx in 0..self.result.len() {
+                    buf[idx] = self.result[idx];
+                }
+                let len = self.result.len();
+                self.sent_stuff_last_time = true;
+                self.result.truncate(0);
+                return len;
+            } else {
+                for idx in 0..buf.len() {
+                    buf[idx] = self.result[idx];
+                }
+                self.result = self.result[buf.len()..].to_vec();
+                self.sent_stuff_last_time = true;
+                return buf.len();
+            }
+        }
+        return 0;
+    }
+}
+
+#[cfg(feature = "server")]
+impl Read for GeneratingSession {
+    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
+        if self.sent_stuff_last_time && self.result.is_empty() {
+            // If we return WouldBlock every time we send something, it'll cause Rocket to
+            // flush available data.
+            self.sent_stuff_last_time = false;
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::WouldBlock,
+                "WouldBlock",
+            ));
+        }
+
+        // Push more data to the upstream if we have something stored.
+        let bytes_read = self.read_from_result(buf);
+        if bytes_read > 0 {
+            return Ok(bytes_read);
+        }
+        if self.tokens.len() >= self.req_max_seq_len {
+            if self.exit_after_one_query {
+                std::process::exit(0);
+            }
+            return Ok(0);
+        }
+        if self.new_tokens_generated >= self.req_max_new_tokens {
+            if self.exit_after_one_query {
+                std::process::exit(0);
+            }
+            return Ok(0);
+        }
+
+        let (mut caches, update_pos) = {
+            let mut ac = self.attention_cache_repository.write().unwrap();
+            match ac.get(&self.tokens) {
+                Some((c, pos)) if pos >= self.prev_pos => (c.true_clone(), pos),
+                Some(_) => {
+                    std::mem::drop(ac);
+                    (self.transformer.make_caches(), 0)
+                }
+                None => {
+                    let caches = self.transformer.make_caches();
+                    ac.put(self.tokens.clone(), caches.true_clone(), self.prev_pos);
+                    (caches, self.prev_pos)
+                }
+            }
+        };
+        if update_pos > self.prev_pos {
+            self.prev_pos = update_pos;
+        }
+
+        assert!(self.result.is_empty());
+        let predictions =
+            self.transformer
+                .forward(&self.tokens[self.prev_pos..], self.prev_pos, &mut caches);
+        self.prev_pos = self.tokens.len();
+        let (highest_pred_idx, token_prob) =
+            self.token_sampler
+                .sample(&predictions, self.tokenizer.as_ref(), &self.tokens);
+        self.tokens.push(highest_pred_idx as TokenId);
+        {
+            let mut ac = self.attention_cache_repository.write().unwrap();
+            ac.put(self.tokens.clone(), caches, self.prev_pos);
+        }
+        self.new_tokens_generated += 1;
+        let token: &str = self.tokenizer.id_to_str(highest_pred_idx as TokenId);
+        let mut is_end_token: bool = false;
+        if token == "</s>" && self.stop_at_end_token {
+            self.new_tokens_generated = self.req_max_new_tokens;
+            is_end_token = true;
+        }
+
+        let mut result: BTreeMap<String, PredResult> = BTreeMap::new();
+        if self.no_token_sampling {
+            // All predictions go the line.
+            let probs = self
+                .token_sampler
+                .logits_to_btreemap(&predictions, self.tokenizer.as_ref());
+            for (k, v) in probs.into_iter() {
+                let mut is_end_token: bool = false;
+                if k == "</s>" {
+                    is_end_token = true;
+                }
+                result.insert(
+                    k,
+                    PredResult {
+                        p: v,
+                        is_end_token: is_end_token,
+                    },
+                );
+            }
+            // Convert to JSON
+            let json = serde_json::to_string(&result).unwrap();
+            self.result.extend(json.as_bytes());
+            self.result.push(b'\n');
+            return Ok(self.read_from_result(buf));
+        } else {
+            result.insert(
+                token.to_string(),
+                PredResult {
+                    p: token_prob,
+                    is_end_token,
+                },
+            );
+            let json = serde_json::to_string(&result).unwrap();
+            self.result.extend(json.as_bytes());
+            self.result.push(b'\n');
+            return Ok(self.read_from_result(buf));
+        }
+    }
+}
+
+#[cfg(feature = "server")]
+struct AttentionCacheRepository {
+    caches: BTreeMap<Vec<TokenId>, (TransformerCaches, usize, std::time::Instant)>,
+    max_sz: usize,
+}
+
+#[cfg(feature = "server")]
+impl AttentionCacheRepository {
+    fn empty(max_size: usize) -> AttentionCacheRepository {
+        AttentionCacheRepository {
+            caches: BTreeMap::new(),
+            max_sz: max_size,
+        }
+    }
+
+    /// Makes sure the cache repository is not larger than sz, evicts any older items.
+    fn limit_size(&mut self, sz: usize) {
+        if sz == 0 {
+            self.caches = BTreeMap::new();
+            return;
+        }
+        // Slow algorithm but I guess our cache will never be unimaginably large so it's probably
+        // fine
+        while self.caches.len() > sz {
+            let mut oldest_time = None;
+            let mut oldest_key: Option<&Vec<TokenId>> = None;
+            for (k, (_, _, time)) in self.caches.iter() {
+                if oldest_time.is_none() || time < oldest_time.unwrap() {
+                    oldest_time = Some(time);
+                    oldest_key = Some(k);
+                }
+            }
+            let oldest_key = oldest_key.unwrap().clone();
+            self.caches.remove(&oldest_key);
+        }
+    }
+
+    fn get(&self, tokens: &[TokenId]) -> Option<(&TransformerCaches, usize)> {
+        if let Some((caches, pos, _)) = self.caches.get(tokens) {
+            Some((caches, *pos))
+        } else {
+            None
+        }
+    }
+
+    fn put(&mut self, tokens: Vec<TokenId>, caches: TransformerCaches, prev_pos: usize) {
+        self.caches
+            .insert(tokens, (caches, prev_pos, std::time::Instant::now()));
+        self.limit_size(self.max_sz);
+    }
+}
+
+#[cfg(feature = "server")]
+#[derive(Clone)]
+struct InferenceServerState {
+    transformer: Arc<Transformer>,
+    tokenizer: Arc<Tokenizer>,
+    max_seq_len: usize,
+    concurrent_requests_semaphore: Semaphore,
+    attention_cache_repository: Arc<RwLock<AttentionCacheRepository>>,
+    exit_after_one_query: bool,
+}
+
+#[cfg(feature = "server")]
+#[post("/", data = "<input>")]
+fn handle_request(
+    state: State<InferenceServerState>,
+    input: Data,
+) -> Result<Stream<GeneratingSession>, status::BadRequest<String>> {
+    let _lock = state.concurrent_requests_semaphore.acquire();
+    let tr = state.transformer.clone();
+    let tok = state.tokenizer.clone();
+
+    let mut data = input.open();
+    let mut databuf: Vec<u8> = Vec::new();
+    data.read_to_end(&mut databuf).unwrap();
+
+    // Parse the JSON out of the request
+    let request: InferenceRequest = match serde_json::from_slice(&databuf) {
+        Err(_e) => {
+            return Err(status::BadRequest(Some("Invalid JSON.".to_string())));
+        }
+        Ok(ir) => ir,
+    };
+
+    let stop_at_end_token = request.stop_at_end_token.unwrap_or(true);
+    let temperature = request.temperature.unwrap_or(1.0);
+    let top_k = request.top_k.unwrap_or(20);
+    let top_p = request.top_p.unwrap_or(1.0);
+    let repetition_penalty = request.repetition_penalty.unwrap_or(1.0);
+    let mut req_max_seq_len = request.max_seq_len.unwrap_or(state.max_seq_len);
+    if req_max_seq_len > state.max_seq_len {
+        req_max_seq_len = state.max_seq_len;
+    }
+    let req_max_new_tokens = request.max_new_tokens.unwrap_or(20);
+    let no_token_sampling = request.no_token_sampling.unwrap_or(false);
+    let prompt = request.prompt;
+
+    if temperature.is_nan() {
+        return Err(status::BadRequest(Some(
+            "Temperature must be a number.".to_string(),
+        )));
+    }
+    if top_k == 0 {
+        return Err(status::BadRequest(Some(
+            "Top-k must be greater than 0.".to_string(),
+        )));
+    }
+    if top_p.is_nan() {
+        return Err(status::BadRequest(Some(
+            "Top-p must be a number.".to_string(),
+        )));
+    }
+    if repetition_penalty.is_nan() {
+        return Err(status::BadRequest(Some(
+            "Repetition penalty must be a number.".to_string(),
+        )));
+    }
+
+    let token_sampler = TokenSampler::new()
+        .temperature(temperature)
+        .top_p(top_p)
+        .top_k(top_k)
+        .repetition_penalty(repetition_penalty);
+    let toks_id: Vec<TokenId> = tok.tokenize_to_ids(prompt.clone());
+    let gsession = GeneratingSession {
+        transformer: tr,
+        tokenizer: tok,
+        attention_cache_repository: state.attention_cache_repository.clone(),
+        token_sampler: token_sampler,
+        tokens: toks_id,
+        req_max_seq_len: req_max_seq_len,
+        req_max_new_tokens: req_max_new_tokens,
+        new_tokens_generated: 0,
+        prev_pos: 0,
+        no_token_sampling: no_token_sampling,
+        stop_at_end_token: stop_at_end_token,
+        sent_stuff_last_time: false,
+        exit_after_one_query: state.exit_after_one_query,
+        result: Vec::new(),
+    };
+
+    return Ok(rocket::response::Stream::chunked(gsession, 1024));
+}
+
+fn command_line_inference(
+    cli: Cli,
+    tr: Arc<Transformer>,
+    tok: Arc<Tokenizer>,
+    prompt: String,
+    be_quiet: bool,
+    max_seq_len: usize,
+    params: ModelParams,
+    max_threads: usize,
+) -> Result<(), Box<dyn std::error::Error>> {
+    // Custom println-like macro that respects be_quiet
+    macro_rules! pln {
+        ($($arg:tt)*) => {
+            if !be_quiet {
+                std::println!($($arg)*);
+            }
+        };
+    }
+
    let mut toks_id: Vec<TokenId> = tok.tokenize_to_ids(prompt.clone());
    let mut prev_pos = 0;
-    let mut token_sampler = TokenSampler::new().temperature(0.8).top_p(0.9).top_k(50);
+    let mut token_sampler = TokenSampler::new()
+        .temperature(1.0)
+        .top_p(1.0)
+        .top_k(20)
+        .repetition_penalty(1.0);

    if let Some(temperature) = cli.temperature {
        token_sampler = token_sampler.temperature(temperature);
@ -196,6 +707,9 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    if let Some(top_k) = cli.top_k {
        token_sampler = token_sampler.top_k(top_k as usize);
    }
+    if let Some(repetition_penalty) = cli.repetition_penalty {
+        token_sampler = token_sampler.repetition_penalty(repetition_penalty);
+    }

    pln!("---");
    pln!(" dim: {}", params.dim);
@ -205,10 +719,16 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    pln!(" norm_eps: {}", params.norm_eps);
    pln!(" vocab_size: {}", params.vocab_size);
    pln!("---");
+    pln!(" maximum number of threads: {}", max_threads);
+    pln!("---");
    pln!("Max sequence length: {}", max_seq_len);
    pln!("Temperature: {}", token_sampler.get_temperature());
    pln!("Top P: {}", token_sampler.get_top_p());
    pln!("Top K: {}", token_sampler.get_top_k());
+    pln!(
+        "Repetition penalty: {}",
+        token_sampler.get_repetition_penalty()
+    );
    pln!("---");
    pln!(
        "{}",
@ -226,11 +746,12 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut times_per_token: Vec<std::time::Duration> = vec![];
    let mut caches = tr.make_caches();
    let mut first: bool = true;
+    let mut stop_seen: bool = false;
    while toks_id.len() < max_seq_len {
        let now = std::time::Instant::now();
-
        let preds = tr.forward(&toks_id[prev_pos..], prev_pos, &mut caches);
-        let highest_pred_idx = token_sampler.sample(&preds);
+
+        let (highest_pred_idx, token_prob) = token_sampler.sample(&preds, &tok, &toks_id);
        toks_id.push(highest_pred_idx as TokenId);

        for (tok_idx, tok_id) in toks_id[prev_pos + 1..].iter().enumerate() {
@ -239,6 +760,10 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
            }
            let mut tok_str: String = "".to_string();
            let tok = tok.id_to_str(*tok_id);
+            if tok == "</s>" {
+                tok_str += "";
+                stop_seen = true;
+            }
            if tok == "<0x0A>" {
                tok_str += "\n";
            } else {
@ -247,7 +772,18 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
            if first && tok_idx < toks_id.len() - 2 {
                // intentionally left empty
            } else {
-                print!("{}", tok_str.truecolor(128, 255, 128));
+                let redness: f32 = token_prob * 255.0;
+                let redness = if redness > 255.0 {
+                    255
+                } else if redness < 0.0 {
+                    0
+                } else {
+                    redness as u8
+                };
+                print!(
+                    "{}",
+                    tok_str.truecolor(128 + redness / 2, 255 - redness / 2, 128)
+                );
            }
        }
        if first {
@ -258,8 +794,14 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        let _ = std::io::stdout().flush();
        prev_pos = toks_id.len() - 1;
        first = false;
+        if stop_seen {
+            break;
+        }
+    }
+    println!();
+    if stop_seen && !be_quiet {
+        println!("Stop token seen. Stopping.");
    }
-    println!("");
    if !be_quiet {
        println!("---");
        println!(
--- a/src/semaphore.rs
+++ b/src/semaphore.rs
@ -0,0 +1,38 @@
+// There is no semaphore in Rust standard library. wat??
+// So I've made a simple one I can use out of a mutex and condition variable..
+
+use std::sync::{Arc, Condvar, Mutex, MutexGuard};
+
+#[derive(Clone)]
+pub struct Semaphore {
+    count: Arc<Mutex<usize>>,
+    waiters: Arc<Condvar>,
+}
+
+pub struct SemaphoreGuard<'a> {
+    mutex_guard: MutexGuard<'a, usize>,
+}
+
+impl<'a> Drop for SemaphoreGuard<'a> {
+    fn drop(&mut self) {
+        *self.mutex_guard += 1;
+    }
+}
+
+impl Semaphore {
+    pub fn new(count: usize) -> Semaphore {
+        Semaphore {
+            count: Arc::new(Mutex::new(count)),
+            waiters: Arc::new(Condvar::new()),
+        }
+    }
+
+    pub fn acquire(&self) -> SemaphoreGuard {
+        let mut count = self.count.lock().unwrap();
+        while *count == 0 {
+            count = self.waiters.wait(count).unwrap();
+        }
+        *count -= 1;
+        SemaphoreGuard { mutex_guard: count }
+    }
+}
--- a/src/simd_support.rs
+++ b/src/simd_support.rs
@ -0,0 +1,116 @@
+// This file contains platform-specific SIMD so that rest of rllama does not need to care which
+// platform it is on.
+
+use core::arch::x86_64::*;
+use half::f16;
+
+pub type I32x8 = __m256i;
+pub type F32x8 = __m256;
+pub type I16x8 = __m128i;
+
+/* ------------------ */
+/* Loading and storing things */
+/* ------------------ */
+
+#[inline]
+pub fn load_i16x8(ptr: *const I16x8) -> I16x8 {
+    unsafe { _mm_loadu_si128(ptr) }
+}
+
+#[inline]
+pub fn store_i16x8(ptr: *mut I16x8, a: I16x8) {
+    unsafe { _mm_storeu_si128(ptr, a) }
+}
+
+#[inline]
+pub fn load_f32x8(ptr: *const F32x8) -> F32x8 {
+    unsafe { _mm256_loadu_ps(ptr as *const f32) }
+}
+
+#[inline]
+pub fn store_f32x8(ptr: *mut F32x8, a: F32x8) {
+    unsafe { _mm256_storeu_ps(ptr as *mut f32, a) }
+}
+
+#[inline]
+pub fn gather_f32x8(ptr: *const f32, indices: I32x8) -> F32x8 {
+    unsafe { _mm256_i32gather_ps(ptr, indices, 1) }
+}
+
+/* ------------------ */
+/* Conversions        */
+/* ------------------ */
+
+#[inline]
+pub fn i16x8_as_f16_to_f32x8(a: I16x8) -> F32x8 {
+    unsafe { _mm256_cvtph_ps(a) }
+}
+
+#[inline]
+pub fn f32x8_to_i16x8_as_f16(a: F32x8) -> I16x8 {
+    unsafe { _mm256_cvtps_ph(a, 0) }
+}
+
+/*
+ * Constants, creating from constants
+ */
+
+pub fn f32x8_zero() -> F32x8 {
+    unsafe { _mm256_setzero_ps() }
+}
+
+pub fn i16x8_zero() -> I16x8 {
+    unsafe { _mm_setzero_si128() }
+}
+
+pub fn f32x8_singleton(value: f32) -> F32x8 {
+    unsafe { _mm256_set1_ps(value) }
+}
+
+pub fn i32x8_from_values(
+    val0: i32,
+    val1: i32,
+    val2: i32,
+    val3: i32,
+    val4: i32,
+    val5: i32,
+    val6: i32,
+    val7: i32,
+) -> I32x8 {
+    unsafe { _mm256_set_epi32(val0, val1, val2, val3, val4, val5, val6, val7) }
+}
+
+/*
+ * Operations
+ */
+
+// FMA
+
+// a * b + c
+pub fn fma_f32x8(a: F32x8, b: F32x8, c: F32x8) -> F32x8 {
+    unsafe { _mm256_fmadd_ps(a, b, c) }
+}
+
+// Horizontal sums
+
+#[inline]
+pub fn horizontal_sum_f32x8(mut ymm: __m256) -> f32 {
+    unsafe {
+        let ymm2 = _mm256_permute2f128_ps(ymm, ymm, 1);
+        ymm = _mm256_add_ps(ymm, ymm2);
+        ymm = _mm256_hadd_ps(ymm, ymm);
+        ymm = _mm256_hadd_ps(ymm, ymm);
+        _mm256_cvtss_f32(ymm)
+    }
+}
+
+#[inline]
+pub fn horizontal_sum_and_f32_to_f16(mut ymm: __m256) -> f16 {
+    unsafe {
+        let ymm2 = _mm256_permute2f128_ps(ymm, ymm, 1);
+        ymm = _mm256_add_ps(ymm, ymm2);
+        ymm = _mm256_hadd_ps(ymm, ymm);
+        ymm = _mm256_hadd_ps(ymm, ymm);
+        f16::from_f32(_mm256_cvtss_f32(ymm))
+    }
+}
--- a/src/tensor.rs
+++ b/src/tensor.rs
--- a/src/tensor_opencl_support.rs
+++ b/src/tensor_opencl_support.rs
@ -2,7 +2,10 @@
 * OpenCL stuff to run (some) of the tensor operations.
 */

-use ocl::{Buffer, Context, Device, Event, Kernel, Platform, Program, Queue};
+use ocl::{
+    enums::DeviceInfo, enums::DeviceInfoResult, Buffer, Context, Device, DeviceType, Event, Kernel,
+    Platform, Program, Queue,
+};
 use std::alloc::Layout;
 use std::sync::{Arc, RwLock};
 use thiserror::Error;
@ -10,8 +13,10 @@ use thiserror::Error;
 #[derive(Debug)]
 #[allow(dead_code)]
 struct Programs {
-    matrix_mul_transposed_by_row_f16_program: Program,
-    matrix_mul_transposed_by_row_f16: Kernel,
+    matrix_mul_transposed_f16_program: Program,
+    matrix_mul_transposed_f16: Kernel,
+    matrix_mul_transposed_f16_cpu_optimized_program: Program,
+    matrix_mul_transposed_f16_cpu_optimized: Kernel,
    silu_f16_program: Program,
    silu_f16: Kernel,
    hadamard_product_f16_program: Program,
@ -26,6 +31,7 @@ pub struct OpenCL {
    ctx: Context,
    queue: Queue,
    programs: Arc<RwLock<Programs>>,
+    is_cpu_device: bool,
 }

 #[derive(Debug)]
@ -104,12 +110,18 @@ impl OpenCL {
            .devices(devices[nth_device].1)
            .build()?;

+        let is_cpu_device = match devices[nth_device].1.info(DeviceInfo::Type)? {
+            DeviceInfoResult::Type(DeviceType::CPU) => true,
+            _ => false,
+        };
+
        let queue = Queue::new(&ctx, devices[nth_device].1, None)?;
        let programs = make_programs(&ctx, &queue)?;
        Ok(OpenCL {
            ctx: ctx,
            queue: queue,
            programs: Arc::new(RwLock::new(programs)),
+            is_cpu_device,
        })
    }

@ -302,38 +314,69 @@ impl OpenCLTensor {
            );
        }

-        // Clear out the target memory
+        // Clear out the target memory.
        unsafe { self.buf.cmd().fill(0u16, None).block(false).enq()? };

        let prg = self.cl.programs.write().unwrap();
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(0, self.buf.clone())?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(1, src.buf.clone())?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(2, other.buf.clone())?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(3, src.cols_capacity as i32)?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(4, other.cols_capacity as i32)?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(5, self.cols_capacity as i32)?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(6, self.rows as i32)?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(7, self.cols as i32)?;
-        prg.matrix_mul_transposed_by_row_f16
-            .set_arg(8, src.cols as i32)?;
+
+        // 0 = CPU optimized
+        // 1 = GPU optimized
+        // 2 = GPU optimized vector multiply (other.rows == 1)
+        const CPU: u8 = 0;
+        const GPU: u8 = 1;
+        let strategy: u8 = if self.cl.is_cpu_device { CPU } else { GPU };
+
+        let prg = if strategy == CPU {
+            &prg.matrix_mul_transposed_f16_cpu_optimized
+        } else {
+            &prg.matrix_mul_transposed_f16
+        };
+        prg.set_arg(0, self.buf.clone())?;
+        prg.set_arg(1, src.buf.clone())?;
+        prg.set_arg(2, other.buf.clone())?;
+        prg.set_arg(3, src.cols_capacity as i32)?;
+        prg.set_arg(4, other.cols_capacity as i32)?;
+        prg.set_arg(5, self.cols_capacity as i32)?;
+        prg.set_arg(6, self.rows as i32)?;
+        prg.set_arg(7, self.cols as i32)?;
+        prg.set_arg(8, src.cols as i32)?;
        let mut event = Event::empty();

+        let rows16 = if self.rows % 16 == 0 {
+            self.rows
+        } else {
+            self.rows + 16 - (self.rows % 16)
+        };
+        let cols16 = if self.cols % 16 == 0 {
+            self.cols
+        } else {
+            self.cols + 16 - (self.cols % 16)
+        };
+
        unsafe {
-            let b = prg
-                .matrix_mul_transposed_by_row_f16
-                .cmd()
-                .queue(&self.queue)
-                .global_work_size([self.rows as usize, self.cols as usize])
-                .enew(&mut event);
-            b.enq()?;
+            if strategy == CPU {
+                let b = prg
+                    .cmd()
+                    .queue(&self.queue)
+                    .global_work_size([self.cols as usize, self.rows as usize])
+                    .enew(&mut event);
+                b.enq()?;
+            } else if strategy == GPU {
+                let b = prg
+                    .cmd()
+                    .queue(&self.queue)
+                    .global_work_size([cols16 as usize, rows16 as usize])
+                    .local_work_size([16, 16])
+                    .enew(&mut event);
+                b.enq()?;
+            } else {
+                let b = prg
+                    .cmd()
+                    .queue(&self.queue)
+                    .global_work_size([self.cols as usize, self.rows as usize])
+                    .enew(&mut event);
+                b.enq()?;
+            }
        }
        self.last_event = Some(event.clone());
        Ok(OpenCLEvent { event })
@ -353,11 +396,27 @@ fn make_programs(ctx: &Context, queue: &Queue) -> Result<Programs, OpenCLError>
        Ok(program)
    }

-    let matrix_mul_transposed_by_row_f16_program =
-        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_BY_ROW_F16_SRC)?;
-    let matrix_mul_transposed_by_row_f16 = Kernel::builder()
-        .program(&matrix_mul_transposed_by_row_f16_program)
-        .name("matrix_mul_transposed_by_row_f16")
+    let matrix_mul_transposed_f16_program =
+        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_F16_SRC)?;
+    let matrix_mul_transposed_f16 = Kernel::builder()
+        .program(&matrix_mul_transposed_f16_program)
+        .name("matrix_mul_transposed_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .arg(&0)
+        .arg(&0)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let matrix_mul_transposed_f16_cpu_optimized_program =
+        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_F16_CPU_OPTIMIZED_SRC)?;
+    let matrix_mul_transposed_f16_cpu_optimized = Kernel::builder()
+        .program(&matrix_mul_transposed_f16_cpu_optimized_program)
+        .name("matrix_mul_transposed_f16_cpu_optimized")
        .arg(None::<&Buffer<u16>>)
        .arg(None::<&Buffer<u16>>)
        .arg(None::<&Buffer<u16>>)
@ -398,8 +457,10 @@ fn make_programs(ctx: &Context, queue: &Queue) -> Result<Programs, OpenCLError>
        .queue(queue.clone())
        .build()?;
    Ok(Programs {
-        matrix_mul_transposed_by_row_f16_program,
-        matrix_mul_transposed_by_row_f16,
+        matrix_mul_transposed_f16_program,
+        matrix_mul_transposed_f16,
+        matrix_mul_transposed_f16_cpu_optimized_program,
+        matrix_mul_transposed_f16_cpu_optimized,
        silu_f16_program,
        silu_f16,
        hadamard_product_f16_program,
@ -409,29 +470,10 @@ fn make_programs(ctx: &Context, queue: &Queue) -> Result<Programs, OpenCLError>
    })
 }

-const MATRIX_MUL_TRANSPOSED_BY_ROW_F16_SRC: &str = r#"
+const MATRIX_MUL_TRANSPOSED_F16_SRC: &str = r#"
 #pragma OPENCL EXTENSION cl_khr_fp16 : enable

-/*
- * Matrix multiplication with a transposed second matrix, using 16-bit floats.
- *
- * One work unit per row.
- *
- * Assumes that each row in the matrices are zero-padded so that there's space for 32 bytes (or 16
- * halfs) of data and we don't need to care if our loops go over the bounds.
- *
- * Operations are done in float32.
- *
- * This thing is not very fast right now. I compared with PyTorch and this is like 20x slower. It
- * is still much faster than CPU. Not sure PyTorch uses cuBlas but if we could get at least
- * somewhere like 50% of that speed I would be happy.
- *
- * The OpenCL on CPU for Ryzen 3950X seems to easily beat my own AVX2 operations.
- *
- * TODO: need to read resources like https://cnugteren.github.io/tutorial/pages/page1.html to
- * figure out how matrix multiply faster.
- */
-__kernel void matrix_mul_transposed_by_row_f16(
+__kernel void matrix_mul_transposed_f16(
    __global half *tgt,
    __global const half *left,
    __global const half *right,
@ -442,14 +484,59 @@ __kernel void matrix_mul_transposed_by_row_f16(
    const int ncols,  // size of target
    const int shared_sz
 ) {
-    int col_iterations = shared_sz / 16;
-    if (shared_sz % 16 != 0) {
-      col_iterations = col_iterations + 1;
+    __local float lefttile[16][16];
+    __local float righttile[16][16];
+
+    const int global_x = get_global_id(0);
+    const int global_y = get_global_id(1);
+    const int local_x = get_local_id(0);
+    const int local_y = get_local_id(1);
+    const int num_tiles = (shared_sz + 15) / 16;
+
+    float sum = 0.0f;
+    for (int t = 0; t < num_tiles; ++t) {
+        if (global_y < nrows) {
+            lefttile[local_y][local_x] = vload_half(global_y * left_cols_capacity + t * 16 + local_x, left);
+        } else {
+            lefttile[local_y][local_x] = 0.0f;
+        }
+        if (global_x < ncols) {
+            righttile[local_y][local_x] = vload_half(global_x * right_cols_capacity + t * 16 + local_y, right);
+        } else {
+            righttile[local_y][local_x] = 0.0f;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+        for (int k = 0; k < 16; ++k) {
+            sum += lefttile[local_y][k] * righttile[k][local_x];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
    }
+    if (global_x < ncols && global_y < nrows) {
+        vstore_half(sum, global_y * ncols_capacity + global_x, (__global half*) tgt);
+    }
+}
+"#;

-    const int tgt_row = get_global_id(0);
-    const int tgt_col = get_global_id(1);
+const MATRIX_MUL_TRANSPOSED_F16_CPU_OPTIMIZED_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable

+__kernel void matrix_mul_transposed_f16_cpu_optimized(
+    __global half *tgt,
+    __global const half *left,
+    __global const half *right,
+    const int left_cols_capacity,
+    const int right_cols_capacity,
+    const int ncols_capacity,
+    const int nrows,
+    const int ncols,  // size of target
+    const int shared_sz
+) {
+    const int tgt_col = get_global_id(0);
+    const int tgt_row = get_global_id(1);
+    int col_iterations = shared_sz / 16;
+    if (shared_sz % 16 != 0) {
+        col_iterations = col_iterations + 1;
+    }
    float16 sum = 0;
    for (int col16 = 0; col16 < col_iterations; col16++) {
        const float16 left8 = vload_half16((tgt_row * left_cols_capacity)/16 + col16, (__global const half*) left);
@ -468,17 +555,13 @@ __kernel void matrix_mul_transposed_by_row_f16(
    float sum6 = sum.sa + sum.sb;
    float sum7 = sum.sc + sum.sd;
    float sum8 = sum.se + sum.sf;
-
    float sum11 = sum1 + sum2;
    float sum12 = sum3 + sum4;
    float sum13 = sum5 + sum6;
    float sum14 = sum7 + sum8;
-
    float sum21 = sum11 + sum12;
    float sum22 = sum13 + sum14;
-
    float total = sum21 + sum22;
-
    vstore_half(total, 0, (__global half*) &tgt[tgt_row * ncols_capacity + tgt_col]);
 }
 "#;
--- a/src/token_sampler.rs
+++ b/src/token_sampler.rs
@ -1,11 +1,13 @@
 use crate::tensor::Tensor;
-use crate::tokenizer::TokenId;
+use crate::tokenizer::{TokenId, Tokenizer};
 use rand::Rng;
+use std::collections::BTreeMap;

 pub struct TokenSampler {
    temperature: f32,
    top_p: f32,
    top_k: usize,
+    repetition_penalty: f32,
 }

 impl Default for TokenSampler {
@ -17,9 +19,11 @@ impl Default for TokenSampler {
 impl TokenSampler {
    pub fn new() -> Self {
        Self {
-            temperature: 0.8,
+            temperature: 0.2,
            top_p: 1.0,
            top_k: 1, // same as argmax
+            repetition_penalty: 0.8, // 1.0 = no penalty. values above 1.0 make repetition
+                      // encouraged which can quickly devolve into repeating loop
        }
    }

@ -35,6 +39,10 @@ impl TokenSampler {
        self.top_k
    }

+    pub fn get_repetition_penalty(&self) -> f32 {
+        self.repetition_penalty
+    }
+
    pub fn temperature(self, temperature: f32) -> Self {
        Self {
            temperature,
@ -50,20 +58,92 @@ impl TokenSampler {
        Self { top_k, ..self }
    }

-    pub fn sample(&self, logits: &Tensor) -> TokenId {
+    pub fn repetition_penalty(self, repetition_penalty: f32) -> Self {
+        Self {
+            repetition_penalty,
+            ..self
+        }
+    }
+
+    pub fn logits_to_btreemap(
+        &self,
+        logits: &Tensor,
+        tokenizer: &Tokenizer,
+    ) -> BTreeMap<String, f32> {
+        let mut result = BTreeMap::new();
+        for token_idx in 0..logits.rows() {
+            result.insert(
+                tokenizer.id_to_str(token_idx as TokenId).to_string(),
+                logits.get_f32(token_idx, 0),
+            );
+        }
+        result
+    }
+
+    pub fn sample(
+        &self,
+        logits: &Tensor,
+        _tokenizer: &Tokenizer,
+        existing_tokens: &[TokenId],
+    ) -> (TokenId, f32) {
+        let mut times_used: BTreeMap<TokenId, usize> = BTreeMap::new();
+        for token in existing_tokens {
+            times_used
+                .entry(*token)
+                .and_modify(|e| *e += 1)
+                .or_insert(1);
+        }
+
        let nrows = logits.rows();
        assert!(logits.cols() == 1);
        let mut logits = logits.transpose();
        if self.temperature > 0.0 {
            logits = logits.scalar_multiply_f32(1.0 / self.temperature);
-            logits = logits.softmax();
        }

+        if self.repetition_penalty != 1.0 {
+            for token_idx in 0..logits.rows() {
+                if let Some(count) = times_used.get(&(token_idx as TokenId)) {
+                    let penalty = self.repetition_penalty.powf(*count as f32);
+                    logits.set_f32(0, token_idx, logits.get_f32(0, token_idx) * penalty);
+                }
+            }
+        }
+        let mut maxv: f32 = std::f32::NEG_INFINITY;
+        for token_idx in 0..logits.rows() {
+            let v = logits.get_f32(0, token_idx);
+            if v > maxv {
+                maxv = v;
+            }
+        }
+        // To numerically stabilize, remove maxv from all logits
+        // softmax(x + c) = softmax(x) where c is a constant, and we make use of htat
+        for token_idx in 0..logits.rows() {
+            logits.set_f32(0, token_idx, logits.get_f32(0, token_idx) - maxv);
+        }
+        logits = logits.softmax();
+
        let mut logitsf: Vec<(TokenId, f32)> = Vec::with_capacity(nrows as usize);
        for i in 0..nrows {
-            logitsf.push((i as TokenId, logits.get_f32(0, i)));
+            let score = logits.get_f32(0, i);
+            logitsf.push((i as TokenId, score));
        }
-        logitsf.sort_unstable_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
+        logitsf.sort_unstable_by(|a, b| {
+            match b.1.partial_cmp(&a.1) {
+                Some(c) => c,
+                None => {
+                    // Sort NaNs to bottom
+                    if b.1.is_nan() {
+                        std::cmp::Ordering::Less
+                    } else if a.1.is_nan() {
+                        return std::cmp::Ordering::Greater;
+                    } else {
+                        return std::cmp::Ordering::Equal;
+                    }
+                }
+            }
+        });
+
        logitsf.truncate(self.top_k);
        let mut p_accum: f32 = 0.0;
        for (idx, v) in logitsf.iter().enumerate() {
@ -78,14 +158,18 @@ impl TokenSampler {
            total_p += v.1;
        }
        let mut rng = rand::thread_rng();
-        let p: f32 = rng.gen_range(0.0..total_p);
+        let p: f32 = if total_p > 0.0 {
+            rng.gen_range(0.0..=total_p)
+        } else {
+            0.0
+        };
        p_accum = 0.0;
        for v in logitsf.into_iter() {
            p_accum += v.1;
            if p_accum >= p {
-                return v.0;
+                return (v.0, v.1 / total_p);
            }
        }
-        0
+        (0, 0.0)
    }
 }
--- a/src/tokenizer.rs
+++ b/src/tokenizer.rs
@ -105,6 +105,16 @@ impl Tokenizer {
        panic!("id out of range");
    }

+    // Tries to find a token from dictionary.
+    pub fn str_to_id(&self, s: &str) -> Option<TokenId> {
+        for (piece_str, piece_info) in self.pieces.iter() {
+            if piece_str == s {
+                return Some(piece_info.idx as i32);
+            }
+        }
+        None
+    }
+
    // Converts a string to a Vec<&str>
    // You may want to use tokenize_to_ids instead.
    //
@ -121,11 +131,23 @@ impl Tokenizer {
            let mut best_candidate: &str = "";
            let mut best_candidate_len: usize = 0;
            let mut skip_s: &str = "";
-            for (piece_str, _piece_info) in self.pieces.iter() {
-                if s.starts_with(piece_str) && best_candidate_len < piece_str.len() {
-                    best_candidate = piece_str;
-                    best_candidate_len = piece_str.len();
-                    skip_s = &s[piece_str.len()..];
+            // Specially recognize newline. Otherwise it matches something we don't actually
+            // want.
+            if s.starts_with('\n') {
+                if self.str_to_id("<0x0A>").is_some() {
+                    best_candidate = "<0x0A>";
+                    best_candidate_len = best_candidate.len();
+                    skip_s = &s[1..];
+                } else {
+                    best_candidate = "\\n";
+                }
+            } else {
+                for (piece_str, _piece_info) in self.pieces.iter() {
+                    if s.starts_with(piece_str) && best_candidate_len < piece_str.len() {
+                        best_candidate = piece_str;
+                        best_candidate_len = piece_str.len();
+                        skip_s = &s[piece_str.len()..];
+                    }
                }
            }
            if best_candidate_len == 0 {
--- a/src/transformer.rs
+++ b/src/transformer.rs
@ -28,6 +28,8 @@ pub struct Transformer {
    output: Tensor,

    layers: Vec<TransformerBlock>,
+
+    data_settings: DataSettings,
 }

 // Clone is cheap
@ -39,6 +41,8 @@ pub struct DataSettings {
    use_opencl_for_attention: bool,
    #[cfg(feature = "opencl")]
    cl: Option<OpenCL>,
+
+    force_f16: bool,
 }

 // OpenCL is safe to send to threads but Rust doesn't know that
@ -51,13 +55,15 @@ impl DataSettings {
        DataSettings {
            use_opencl_for_feedforward: false,
            use_opencl_for_attention: false,
+            force_f16: false,
            cl: cl.clone(),
        }
    }

+    #[allow(clippy::new_without_default)]
    #[cfg(not(feature = "opencl"))]
    pub fn new() -> Self {
-        DataSettings {}
+        DataSettings { force_f16: false }
    }

    #[cfg(feature = "opencl")]
@ -69,6 +75,11 @@ impl DataSettings {
        self.use_opencl_for_attention = true;
        self
    }
+
+    pub fn force_f16(mut self) -> DataSettings {
+        self.force_f16 = true;
+        self
+    }
 }

 pub struct TransformerCaches {
@ -85,25 +96,59 @@ pub struct TransformerBlock {
 pub struct AttentionCache {
    cache_k: Vec<Arc<RwLock<Tensor>>>,
    cache_v: Vec<Arc<RwLock<Tensor>>>,
+    data_settings: DataSettings,
 }

 impl AttentionCache {
-    fn new(max_seq_len: usize, n_local_heads: usize, head_dim: usize) -> Self {
+    fn new(
+        max_seq_len: usize,
+        n_local_heads: usize,
+        head_dim: usize,
+        data_settings: &DataSettings,
+    ) -> Self {
        let mut cache_k = Vec::with_capacity(n_local_heads);
        let mut cache_v = Vec::with_capacity(n_local_heads);
+
+        let dtype = if data_settings.force_f16 {
+            TensorDType::Float16
+        } else {
+            TensorDType::Float32
+        };
        for _ in 0..n_local_heads {
            cache_k.push(Arc::new(RwLock::new(Tensor::zeros(
                head_dim as i64,
                max_seq_len as i64,
-                TensorDType::Float32,
+                dtype,
            ))));
            cache_v.push(Arc::new(RwLock::new(Tensor::zeros(
                head_dim as i64,
                max_seq_len as i64,
-                TensorDType::Float32,
+                dtype,
            ))));
        }
-        AttentionCache { cache_k, cache_v }
+        AttentionCache {
+            cache_k,
+            cache_v,
+            data_settings: data_settings.clone(),
+        }
+    }
+
+    /// Cloning AttentionCache normally just makes new references to the same cache.
+    /// This creates a true clone with copied tensors.
+    fn true_clone(&self) -> AttentionCache {
+        let mut cache_k = Vec::with_capacity(self.cache_k.len());
+        let mut cache_v = Vec::with_capacity(self.cache_v.len());
+        for idx in 0..self.cache_k.len() {
+            let old_k = self.cache_k[idx].read().unwrap();
+            cache_k.push(Arc::new(RwLock::new(old_k.clone())));
+            let old_v = self.cache_v[idx].read().unwrap();
+            cache_v.push(Arc::new(RwLock::new(old_v.clone())));
+        }
+        AttentionCache {
+            cache_k,
+            cache_v,
+            data_settings: self.data_settings.clone(),
+        }
    }

    fn shift_left(&mut self, shifts: usize) {
@ -132,6 +177,14 @@ impl TransformerCaches {
            layer.shift_left(shifts);
        }
    }
+
+    pub fn true_clone(&self) -> TransformerCaches {
+        let mut layer_caches = Vec::with_capacity(self.layer_caches.len());
+        for layer in self.layer_caches.iter() {
+            layer_caches.push(layer.true_clone());
+        }
+        TransformerCaches { layer_caches }
+    }
 }

 pub struct RMSNorm {
@ -139,6 +192,7 @@ pub struct RMSNorm {
    weight: Tensor,
 }

+#[allow(dead_code)]
 pub struct Attention {
    wq: Tensor,
    wk: Tensor,
@ -208,6 +262,7 @@ impl Transformer {

        Ok(Transformer {
            freqs_cis: compute_freqs_cis(dim / n_heads, max_seq_len, 10000.0),
+            data_settings: data_settings.clone(),
            emb,
            dim,
            n_layers,
@ -230,6 +285,7 @@ impl Transformer {
                self.max_seq_len,
                self.n_local_heads,
                self.head_dim,
+                &self.data_settings,
            ));
        }
        TransformerCaches {
@ -354,11 +410,12 @@ impl RMSNorm {
        let data_dir: &Path = data_dir.as_ref();
        let weights = Tensor::from_unpickled_pieces(
            &unpickled[0..=0],
-            name.clone(),
+            name,
            data_dir,
            FromPiecesDirection::Rows,
        )?
        .to_f32();
+
        Ok(Self {
            eps,
            weight: weights,
@ -400,6 +457,12 @@ impl FeedForward {
            FromPiecesDirection::Rows,
        )?;

+        if data_settings.force_f16 {
+            w1 = w1.to_f16();
+            w2 = w2.to_f16();
+            w3 = w3.to_f16();
+        }
+
        #[cfg(feature = "opencl")]
        {
            if data_settings.use_opencl_for_feedforward {
@ -407,17 +470,12 @@ impl FeedForward {
                w2 = w2.to_f16();
                w3 = w3.to_f16();
                let ds = data_settings.clone();
-                w1.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                w2.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                w3.to_gpu(&ds.cl.unwrap()).unwrap();
+                w1.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                w2.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                w3.to_gpu_inplace(&ds.cl.unwrap()).unwrap();
            }
        }
-        #[cfg(not(feature = "opencl"))]
-        {
-            w1 = w1.to_f32();
-            w2 = w2.to_f32();
-            w3 = w3.to_f32();
-        }
+        // w1, w2, w3 maybe be f32 or f16 depending on source data.

        Ok(Self {
            w1,
@ -428,36 +486,46 @@ impl FeedForward {
    }

    pub fn forward(&self, x: &mut Tensor) -> Tensor {
+        let original_x_dtype = x.dtype();
+        if x.dtype() != self.w1.dtype() {
+            *x = x.to_same_type(&self.w1);
+        }
        #[cfg(feature = "opencl")]
        let x_was_on_cpu: bool;
        #[cfg(feature = "opencl")]
        {
            x_was_on_cpu = x.is_on_cpu();
            if self.data_settings.use_opencl_for_feedforward {
-                *x = x.to_f16();
-                x.to_gpu(self.data_settings.cl.as_ref().unwrap()).unwrap();
+                x.to_gpu_inplace(self.data_settings.cl.as_ref().unwrap())
+                    .unwrap();
            }
        }
-        let (w1_out, w3_out) = rayon::join(
+        let (mut w1_out, mut w3_out) = rayon::join(
            || self.w1.matrix_mul_transposed(x),
            || self.w3.matrix_mul_transposed(x),
        );
-        let w1_out = w1_out.silu();
-        let w1w3_out = w1_out.hadamard_product(&w3_out).transpose();

+        // Float16 not supported for some of these ops on CPU.
+        if w1_out.is_on_cpu() && w1_out.dtype() == TensorDType::Float16 {
+            w1_out = w1_out.to_f32();
+            w3_out = w3_out.to_f32();
+        }
+        let w1_out = w1_out.silu();
+        let mut w1w3_out = w1_out.hadamard_product(&w3_out).transpose();
+        if w1w3_out.dtype() != self.w2.dtype() {
+            w1w3_out = w1w3_out.to_same_type(&self.w2);
+        }
        #[cfg(not(feature = "opencl"))]
-        if w1w3_out.rows() == 1 {
-            return self
-                .w2
-                .matrix_vector_mul_transposed_multithreaded(&w1w3_out);
-        } else {
-            return self.w2.matrix_mul_transposed(&w1w3_out);
+        {
+            self.w2
+                .matrix_mul_transposed(&w1w3_out)
+                .into_dtype(original_x_dtype)
        }
        #[cfg(feature = "opencl")]
        {
            let mut result = self.w2.matrix_mul_transposed(&w1w3_out);
            if x_was_on_cpu {
-                result.to_cpu().unwrap();
+                result.to_cpu_inplace().unwrap();
                result
            } else {
                result
@ -502,6 +570,13 @@ impl Attention {
            FromPiecesDirection::Cols,
        )?;

+        if data_settings.force_f16 {
+            wq = wq.to_f16();
+            wk = wk.to_f16();
+            wv = wv.to_f16();
+            wo = wo.to_f16();
+        }
+
        #[cfg(feature = "opencl")]
        {
            if data_settings.use_opencl_for_attention {
@ -510,19 +585,12 @@ impl Attention {
                wv = wv.to_f16();
                wo = wo.to_f16();
                let ds = data_settings.clone();
-                wq.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wk.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wv.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wo.to_gpu(&ds.cl.unwrap()).unwrap();
+                wq.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wk.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wv.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wo.to_gpu_inplace(&ds.cl.unwrap()).unwrap();
            }
        }
-        #[cfg(not(feature = "opencl"))]
-        {
-            wq = wq.to_f32();
-            wk = wk.to_f32();
-            wv = wv.to_f32();
-            wo = wo.to_f32();
-        }

        Ok(Self {
            wq,
@ -543,14 +611,19 @@ impl Attention {
        mask: &Option<Tensor>,
        attention_cache: &mut AttentionCache,
    ) -> Tensor {
+        let original_x_dtype = x.dtype();
+        if x.dtype() != self.wq.dtype() {
+            *x = x.to_same_type(&self.wq);
+        }
+
        #[cfg(feature = "opencl")]
        let x_was_on_cpu: bool;
        #[cfg(feature = "opencl")]
        {
            x_was_on_cpu = x.is_on_cpu();
            if self.data_settings.use_opencl_for_attention {
-                *x = x.to_f16();
-                x.to_gpu(self.data_settings.cl.as_ref().unwrap()).unwrap();
+                x.to_gpu_inplace(self.data_settings.cl.as_ref().unwrap())
+                    .unwrap();
            }
        }

@ -560,19 +633,19 @@ impl Attention {
            let mut xq_out = x.matrix_mul_transposed(&self.wq);
            let mut xk_out = x.matrix_mul_transposed(&self.wk);
            let mut xv_out = x.matrix_mul_transposed(&self.wv);
-            xq_out.to_cpu().unwrap();
-            xk_out.to_cpu().unwrap();
-            xv_out.to_cpu().unwrap();
+            xq_out.to_cpu_inplace().unwrap();
+            xk_out.to_cpu_inplace().unwrap();
+            xv_out.to_cpu_inplace().unwrap();
            (xq_out.to_f32(), xk_out.to_f32(), xv_out.to_f32())
        };

        #[cfg(not(feature = "opencl"))]
        let (xq_out, (xk_out, xv_out)) = rayon::join(
-            || x.matrix_mul_transposed(&self.wq),
+            || x.matrix_mul_transposed(&self.wq).to_f32(),
            || {
                rayon::join(
-                    || x.matrix_mul_transposed(&self.wk),
-                    || x.matrix_mul_transposed(&self.wv),
+                    || x.matrix_mul_transposed(&self.wk).to_f32(),
+                    || x.matrix_mul_transposed(&self.wv).to_f32(),
                )
            },
        );
@ -638,6 +711,9 @@ impl Attention {
                let keys = cache_k.clip_cols(start_pos + seq_len as usize);
                let values = cache_v.clip_cols(start_pos + seq_len as usize);

+                let keys = keys.into_same_type(&xq_row);
+                let values = values.into_same_type(&xq_row);
+
                let m = xq_row
                    .matrix_mul(&keys)
                    .scalar_multiply_f32(1.0 / (self.head_dim as f32).sqrt());
@ -664,7 +740,9 @@ impl Attention {
                #[cfg(not(feature = "opencl"))]
                {
                    let xq_row = Tensor::concat(&concat_vec2).view(1, self.wo.rows());
-                    xq_row.matrix_mul_transposed(&self.wo)
+                    xq_row
+                        .into_same_type(&self.wo)
+                        .matrix_mul_transposed(&self.wo)
                }
                #[cfg(feature = "opencl")]
                {
@ -673,10 +751,10 @@ impl Attention {
                        .to_f16();
                    if self.wo.is_on_gpu() {
                        xq_row
-                            .to_gpu(&self.data_settings.cl.as_ref().unwrap())
+                            .to_gpu_inplace(&self.data_settings.cl.as_ref().unwrap())
                            .unwrap();
                        let mut result = xq_row.matrix_mul_transposed(&self.wo);
-                        result.to_cpu().unwrap();
+                        result.to_cpu_inplace().unwrap();
                        result.to_f32()
                    } else {
                        xq_row.matrix_mul_transposed(&self.wo)
@ -687,7 +765,7 @@ impl Attention {

        let output3: Vec<&Tensor> = output2.iter().collect();
        let output2: Tensor = Tensor::concat(&output3);
-        output2
+        output2.into_dtype(original_x_dtype)
    }
 }

--- a/src/unpickler.rs
+++ b/src/unpickler.rs
@ -587,7 +587,7 @@ pub fn unpickle(bytes: &[u8]) -> Result<Value, UnpicklingError> {
                ));
            }
            let idx = u32::from_le_bytes([bytes[1], bytes[2], bytes[3], bytes[4]]);
-            match memo.get(&(idx as u32)) {
+            match memo.get(&{ idx }) {
                None => {
                    return Err(UnpicklingError::UnpicklingError(
                        "LONG_BINGET index out of range".to_string(),
--- a/src/weight_compression.rs
+++ b/src/weight_compression.rs
@ -0,0 +1,65 @@
+use crate::tensor::Tensor;
+use rand::{thread_rng, Rng};
+use rayon::prelude::*;
+use std::sync::atomic::{AtomicUsize, Ordering};
+use std::sync::{Arc, RwLock};
+
+pub fn quantize(tensor: &Tensor) -> Tensor {
+    /*
+     * This is a simplistic rounding quantizer. It splits each row in a tensor to 16 buckets and
+     * takes the average value in said buckets as the quantized weight.
+     */
+    let mut result = Tensor::zeros(tensor.rows(), tensor.cols(), tensor.dtype());
+    for row in 0..tensor.rows() {
+        let mut values: Vec<f32> = Vec::with_capacity(tensor.cols() as usize);
+        if row % 500 == 0 {
+            println!("{}", row,);
+        }
+        values.truncate(0);
+        let mut mi: f32 = std::f32::MAX;
+        let mut ma: f32 = std::f32::MIN;
+
+        for col in 0..tensor.cols() {
+            let val = tensor.get_f32(row, col);
+            if val < mi {
+                mi = val;
+            }
+            if val > ma {
+                ma = val;
+            }
+            values.push(val);
+        }
+        values.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
+        let mut allowed_values: Vec<f32> = Vec::with_capacity(16);
+        let mut rng = thread_rng();
+        for i in 0..16 {
+            let start_idx = i * values.len() / 16;
+            let end_idx = (i + 1) * values.len() / 16;
+
+            let mut avg = 0.0;
+            for j in start_idx..end_idx {
+                avg += values[j];
+            }
+            avg /= (end_idx - start_idx) as f32;
+            allowed_values.push(avg);
+        }
+        allowed_values[0] = mi;
+        allowed_values[15] = ma;
+        allowed_values.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
+
+        for col in 0..tensor.cols() {
+            let val = tensor.get_f32(row, col);
+            let mut best = 0;
+            let mut best_dist = std::f32::MAX;
+            for i in 0..16 {
+                let dist = (val - allowed_values[i] as f32).abs();
+                if dist < best_dist {
+                    best = i;
+                    best_dist = dist;
+                }
+            }
+            result.set_f32(row, col, allowed_values[best] as f32);
+        }
+    }
+    result
+}
Author	SHA1	Message	Date
Mikko Juola	f6249e8d9f	Add skeleton code for 4-bit quantization. The type is now recognized and I have a very simple quantizer too but no operations are done yet.	3 years ago
Mikko Juola	26f343ad15	Add a flag that will exit the HTTP server after just one query. This is for some experiments I want to run to kill the server gracefully whenever I pull out the logits out of it from a Python script.	3 years ago
Mikko Juola	957a8f9f98	Mention that `server` feature must be turned on to use the inference API.	3 years ago
Mikko Juola	5e241722cb	Fix compilation when opencl feature is being used.	3 years ago
Mikko Juola	d85ed7f23e	Mention HTTP server in features in README.md	3 years ago
Mikko Juola	a8320613a1	Fix some things in README.md after proofreading it and removing lies.	3 years ago
Mikko Juola	b9be485610	Add simple HTTP API support. It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.	3 years ago
Mikko Juola	9c86c17318	Refactor all SIMD to one file, simd_support.rs This should make it a bit easier to port to other SIMD instruction sets when the SIMD instructions are not littered randomly around the tensor.rs file.	3 years ago
Mikko Juola	25e3e12d9d	Update README.md on LLaMA-65B benchmark result.	3 years ago
Mikko Juola	f233f8ad8f	Forgot to mark last benchmark at March 17	3 years ago
Mikko Juola	db0f22ed26	Update README.md, add a nice animation.	3 years ago
Mikko Juola	cfad4b1205	Bump version to 0.3.0	3 years ago
Mikko Juola	016b609481	More install instructions.	3 years ago
Mikko Juola	2666571e2b	Update README.md to show `rllama` is on crates.io now.	3 years ago
Mikko Juola	58b61cba39	Bump version to 0.2.0	3 years ago
Mikko Juola	ebdea727fd	Don't let the crate be built without avx2, avx, etc. or it'll be very slow.	3 years ago
Mikko Juola	f2c38a272f	Update Cargo dependencies.	3 years ago
Mikko Juola	91dee4f114	Add --quiet flag, make colors respect --quiet so you just get the output and nothing else.	3 years ago
Mikko Juola	109171b50e	Mention that this is AMD64 only because of AVX2.	3 years ago
Mikko Juola	ff349eeea0	Make number of threads configurable and obtained by default from the system rather than hardcoding to 32.	3 years ago
Mikko Juola	44e0abf0f1	Clarify that the OpenCL implementations all use f16.	3 years ago
Mikko Juola	58463458ee	Put benchmarks on top of README.md.	3 years ago
Mikko Juola	882ff05254	Update README.md for new benchmarks.	3 years ago
Mikko Juola	3d0afcf243	Make matrix multiplication multithreaded. This improves performance greatly with f16. It's faster now than OpenCL on LLaMA-7B.	3 years ago
Mikko Juola	8134c20d57	We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.	3 years ago
Mikko Juola	1f5e687298	Modest improvement to f16 matrix_vector_mul_transposed without OpenCL. It's still signicantly slower than the f32 version.	3 years ago
Mikko Juola	acfd6bd5bd	Add f16, non-OpenCL version of matrix_vector_mul_transposed as well. This seems to be 100% slower than the pure f32 version in benchmark. Not sure why as of this commit, but I'll investigate further.	3 years ago
Mikko Juola	baecd25ee3	Add f16 version of matrix multiplication that works without any OpenCL. In benchmark it is modestly faster than f32. The main transformer loop doesn't know how to use f16 yet though, and I need to implement some other ops for that to start working.	3 years ago
Mikko Juola	a1970b8a9c	Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.	3 years ago
Mikko Juola	61bc42b728	Improve the handwritten AVX2 for matrix_mul_inplace_transposed. This is something like ~60% faster than old version.	3 years ago
Mikko Juola	0cce655763	Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation. Modest improvement in overall performance for text generation.	3 years ago
Mikko Juola	09f76dfcfa	Update README.md opening with new benchmark numbers.	3 years ago
Mikko Juola	4b8accee44	Update benchmarks.	3 years ago
Mikko Juola	de5dd59277	Some code cleanup in OpenCL.	3 years ago
Mikko Juola	8aef5d8831	Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.	3 years ago
Mikko Juola	1c5ec04217	Add a different kernel to be used when OpenCL device is a CPU. This is almost the same code I had before. It runs better on CPUs rather than GPUs.	3 years ago
Mikko Juola	8c64313fec	Rewrite the matrix multiplication. This is something like ~10 times faster than the old one. But surprisingly this didn't have much impact on text generation time. Maybe most of the remaining slowness is no more from matrix multiplication. Also this slowed down CPU implementation. I think I'll try adding another kernel later for CPU OpenCL.	3 years ago
Mikko Juola	862d4a15d6	Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable.	3 years ago
Mikko Juola	f4629ca987	Respect the stop token from the model.	3 years ago
Mikko Juola	de477314ed	Fix newlines not recognized when feeding newlines in the prompt. Tokenizer would misinterpret the newlines. In general, the non-printable control characters don't seem to be tokenized correctly at the moment. I added band-aid for newlines but should maybe fix the others too.	3 years ago
Mikko Juola	687bbf1249	Add instructions on how to use OpenCL in the README.md	3 years ago
Mikko Juola	8de18bdc77	Add screenshot to README.md.	3 years ago
Mikko Juola	a2e88c1193	Update README.md	3 years ago