Add lots of code I added for OpenCL, but text generation got broken and I have no idea why.

I don't want to throw this away so I'll park it.
19 changed files with 1567 additions and 3232 deletions
--- a/.cargo/config
+++ b/.cargo/config
@ -1,2 +0,0 @@
-[build]
-rustflags = ["-C", "target-feature=+avx2,+avx,+sse,+fma"]
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@ -1,15 +1,7 @@
 [package]
 name = "rllama"
-version = "0.3.0"
+version = "0.1.0"
 edition = "2021"
-authors = ["Mikko Juola"]
-description = "Pure Rust implementation of LLaMA-family of models, executable"
-documentation = "https://github.com/Noeda/rllama"
-homepage = "https://github.com/Noeda/rllama"
-repository = "https://github.com/Noeda/rllama"
-license = "AGPL-3.0"
-keywords = ["llama", "machine-learning"]
-categories = ["command-line-utilities"]

 [lib]
 path = "src/lib.rs"
@ -32,14 +24,10 @@ indicatif = "0.17"
 colored = "2"
 serde = { version = "1", features = ["derive"] }
 serde_json = "1"
-mimalloc = "0.1"
-ocl = { version = "0.19", optional = true }
-rocket = { version = "0.4", features = ["sse"], optional = true }
-lazy_static = "1.4"
+ocl = "0.19"

 [features]
-opencl = ["ocl"]
-server = ["rocket"]
+opencl = []

 # We need protobuf compiler
 [build-dependencies]
@ -50,7 +38,6 @@ protobuf-parse = "3.2"
 criterion = "0.4"

 [profile.release]
-panic = 'abort'
 debug = true

 [[bench]]
--- a/README.md
+++ b/README.md
@ -1,234 +1,82 @@
 # RLLaMA

-RLLaMA is a pure Rust implementation of [LLaMA large language model inference.](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/).
+This is my attempt at making the LLaMA language model working on a pure Rust
+CPU implementation. I was inspired by an amazing CPU implementation here:
+https://github.com/ggerganov/ggml that could run GPT-J 8B models.

-## Supported features
+As of writing of this, this can run LLaMA-7B at around ~1 token per second, on
+a Ryzen 3950X using something like 1.5 threads because I haven't yet properly
+figured out how to multithread this.

-  * Uses either `f16` and `f32` weights.
-  * LLaMA-7B, LLaMA-13B, LLaMA-30B, LLaMA-65B all confirmed working
-  * Hand-optimized AVX2 implementation
-  * OpenCL support for GPU inference.
-  * Simple HTTP API support, with the possibility of doing token sampling on
-    client side
+I've also managed to run LLaMA-13B which just barely fits in my 64-gig machine
+with 32-bit float weights everywhere.

-## Performance
+LLaMA-30B technically runs but my computer does not have enough memory to keep
+all the weights around so generating a token takes minutes.

-The current performance is as follows:
+I have not tried LLaMA-60B but presumably if all the smaller models work it
+would run given a sufficiently chonky computer.

-```
-Pure Rust implementations:
-
-LLaMA-7B:  AMD Ryzen 3950X:                       552ms / token     f16    (pure Rust)
-LLaMA-7B:  AMD Ryzen 3950X:                       1008ms / token    f32    (pure Rust)
-LLaMA-13B: AMD Ryzen 3950X:                       1029ms / token    f16    (pure Rust)
-LLaMA-13B: AMD Ryzen 3950X:                       1930ms / token    f32    (pure Rust)
-LLaMA-30B: AMD Ryzen 5950X:                       2112ms / token    f16    (pure Rust)
-LLaMA-65B: AMD Ryzen 5950X:                       4186ms / token    f16    (pure Rust)
-
-OpenCL (all use f16):
-
-LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  247ms / token            (OpenCL on GPU)
-LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token            (OpenCL on CPU)
-LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <I ran out of GPU memory :(>
-LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token           (OpenCL on CPU)
-LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token           (OpenCL on CPU)
-```
+This uses AVX2 intrinsics to speed up itself. Therefore, you need an x86-family
+CPU to run this.

-Scroll to the bottom of this README.md to see benchmarks over time.
+It also has a Python unpickler that understands the `.pth` files used by
+PyTorch. Well almost, it doesn't unzip them automatically (see below).

-## Screenshot
+# How to run

-![Screenshot of RLLaMA in action](rllama.gif)
+You will need Rust. Make sure you can run `cargo` from a command line. In
+particular, this is using unstable features so you need nightly rust. Make sure
+if you write `cargo --version` it is nightly.

-## Install
-
-You can install with `cargo` tool. RLLaMA uses intrinsics extensively and you
-likely need to enable them to install the executable.
-
-```
-RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
-```
+You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/

-There is a `.cargo/config.toml` inside this repository that will enable these
-features if you install manually from this Git repository instead.
-
-## LLaMA weights
-
-Refer to https://github.com/facebookresearch/llama/ As of now, you need to be
-approved to get weights.
-
-For LLaMA-7B make sure, you got these files:
-
-```shell
-* 7B/consolidated.00.pth
-* 7B/params.json
-* tokenizer.model
-```
-
-The `consolidated.00.pth` is actually a zip file. You need to unzip it:
+Once you have 7B weights, and the `tokenizer.model` it comes with, you need to
+decompress it.

 ```shell
+$ cd LLaMA
 $ cd 7B
 $ unzip consolidated.00.pth
+# For LLaMA-7B, rename consolidated to consolidated.00
+# For the larger models, the number is there already so no need to do this step.
 $ mv consolidated consolidated.00
 ```

-If you are using a larger model like LLaMA-13B, then you can skip the last step
-of renaming the `consolidated` directory.
-
-You should now be ready to generate some text.
-
-## Example
-
-Run LLaMA-7B with some weights casted to 16-bit floats:
+You should then be ready to generate some text.

 ```shell
-rllama --tokenizer-model /path/to/tokenizer.model \
-       --model-path /path/to/LLaMA/7B \
-       --param-path /path/to/LLaMA/7B/params.json \
-       --f16 \
-       --prompt "The meaning of life is"
+cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
 ```

-Use `rllama --help` to see all the options.
-
-## Inference server
-
-`rllama` can run in an inference server mode with a simple HTTP JSON API. You
-need to enable `server` features for this.
-
-```
-cargo build --release --features server
-```
-
-The command line flags for this are:
-
-  * `--inference-server` using this will turn on the inference server.
-  * `--inference-server-port` sets the port. Default port is 8080.
-  * `--inference-server-host` sets the host. The default host is 127.0.0.1.
-  * `--inference-server-max-concurrent-inferences` sets how many concurrent
-    requests are allowed to be actively doing inference at the same time. The
-    default is 5.
-  * `--inference-server-api-path` sets which path servers the API requests. The
-    default path is `/rllama/v1/inference`
-  * `--inference-server-prompt-cache-size` sets how many previous prompt
-    calculations should be cached. Default is 50. This speeds up token
-    generation for prompts that were already requested before, however it also
-    increases memory use as the cache gets more full.
-  * `--inference-server-exit-after-one-query` will make the server exit with
-    exit code 0 after it has served one HTTP query. This is used for
-    troubleshooting and experiments.
+Right now it seems to use around ~25 gigabytes of memory for 7B and around ~50
+gigabytes for 13B. If you don't use OpenCL, then all parameters are cast to
+32-bit floats.

-Prompts and flags related to token sampling are all ignored in inference server
-mode. Instead, they are obtained from each HTTP JSON API request.
+You can use `--temperature`, `--top-p` and `--top-k` to adjust token sampler
+settings.

-### Inference server API
-
-There is an `examples/api_hello_world.py` for a minimal API use example.
-
-```
-POST /rllama/v1/inference
-```
-
-Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.
-
-The expected JSON is as follows:
-
-```
-  {
-     "temperature":        <number, optional>
-     "top_k":              <integer, optional, default 20>
-     "top_p":              <number, optional, default: 1.0>
-     "repetition_penalty": <number, optional, default: 1.0>
-     "stop_at_end_token":  <bool, optional, default: true>
-     "max_seq_len":        <integer, optional, default: 1024. Clamped to
-                            be at highest the same as --max-seq-len command line option.>
-     "max_new_tokens":     <integer, optional, default: 1024>
-     "no_token_sampling":  <bool, optional, default: false>
-     "prompt":             <string, required>
-  }
-```
-
-The form of the response depends on if `no_token_sampling` is set to true or false. The
-response is in JSONL, i.e. multiple JSON dictionaries, separated by newlines.
-
-`no_token_sampling` can turn off `rllama`'s own token sampling. In this case,
-the probabilities for every token are returned instead.
-
-When no\_token\_sampling = false:
-
-```
-{<token string>: {"p": <number>, "is_end_token": bool, might not be present}}
-```
-
-  * `token` contains the new token to be appended to output. It does not
-    include string you fed to the system originally.
-  * `p` is the probability that this token was chosen. For example, if this
-    value is 0.1, it means that this particular token had 10% chance of being
-    selected with the current token sampling settings.
-  * `is_end_token` is `true` is the given token signifies end of output. This
-    field is not present otherwise.
-
-When no\_token\_sampling = true:
-
-```
-{<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
-,<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
-,...}
-```
-
-If you want to implement your own token sampling, you may want to set
-`max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own
-sampling behavior entirely.
-
-`rllama` internally caches recently queried prompts and the intermediate
-computations so that it's able to continue off quickly if you issue a query
-that is either the same as a previous query or a continuation of one.
-
-## How to turn on OpenCL
-
-Use `opencl` Cargo feature.
-
-```
-RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama --features opencl
-```
-
-```
-rllama --tokenizer-model /path/to/tokenizer.model \
-       --model-path /path/to/LLaMA/7B \
-       --param-path /path/to/LLaMA/7B/params.json \
-       --opencl-device 0 \
-       --prompt "The meaning of life is"
-```
-
-With `opencl` feature, there is also another argument, `--opencl-device` that
-takes a number. That number selects Nth OpenCL device found on the system. You
-can see the devices in the output when you run the program (e.g. see the
-screenshot below).
-
-Weights are always cast to 16-bit floats for OpenCL.
-
-## Notes and future plans
+# Future plans

 This is a hobby thing for me so don't expect updates or help.

 * Some other CPU implementations use quantization to reduce the size of weights
-  and generally speed up everything a lot. `rllama` does not have this.
-* I've heard there is some thing called Tensor Cores on nVidia GPUs. Not
+* Put some of the operations on the OpenCL GPU/CPU. I've made some initial
+  OpenCL code but there's still bunch of stuff that could be OpenCLified.
+  The OpenCL code is fast for both GPU OpenCL and CPU OpenCL (better than my
+  own handwritten AVX2 code which makes me sad).
+* I've heard there is some thing called Tensor Cores on NVidia GPUs. Not
  accessible with OpenCL. But might be accessible on Vulkan with a an
  extension.
 * More sophisticated token sampling. I saw on Hackernews some comments how the
-  samplers included in Facebook's reference code are kinda garbage and you can
-  get much better results with good defaults and things like repetition
-  penalty.
+  samplers are kinda garbage and you can get much better results with good
+  defaults and things like repetition penalty.
 * There is an initial start-up time as the program has to pass through the
  initial prompt. I don't know if this start-up time can be eliminated
  completely but it could be cached on disk. Use cases like having a standard
  prompt to prime the text generation that you reuse many times.
-* Stanford released some instruct-finetuned LLaMA-7B, once I find the weights
-  then I'd like to try make a chat-like command-line interface.

-## Benchmarks
+# Benchmarks

 I'm trying to track that I'm making this faster and not slower.

@ -242,13 +90,11 @@ cargo run --release --
          --prompt "Computers are pretty complica" --max-seq-len 50

 # commit c9c861d199bd2d87d7e883e3087661c1e287f6c4  (13 March 2023)
-
 LLaMA-7B:  AMD Ryzen 3950X: 1058ms / token
 LLaMA-13B: AMD Ryzen 3950X: 2005ms / token

 # commit 63d27dba9091823f8ba11a270ab5790d6f597311  (13 March 2023)
 # This one has one part of the transformer moved to GPU as a type of smoke test
-
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  567ms / token
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  956ms / token
 LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  987ms / token
@ -256,33 +102,8 @@ LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1706ms / token

 # commit 35b0c372a87192761e17beb421699ea5ad4ac1ce  (13 March 2023)
 # I moved some attention stuff to OpenCL too.
-
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  283ms / token
 LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  679ms / token
 LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <ran out of GPU memory>
 LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1226ms / token
-
-# commit de5dd592777b3a4f5a9e8c93c8aeef25b9294364  (15 March 2023)
-# The matrix multiplication on GPU is now much faster. It didn't have that much
-# effect overall though, but I got modest improvement on LLaMA-7B GPU.
-
-LLaMA-7B:  AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  247ms / token
-LLaMA-7B:  AMD Ryzen 3950X + OpenCL Ryzen 3950X:  680ms / token
-LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:  <ran out of GPU memory>
-LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X:  1232ms / token
-LLaMA-30B: AMD Ryzen 5950X + OpenCL Ryzen 5950X:  4098ms / token
-
-# commit 3d0afcf24309f28ec540ed7645c35400a865ad6f  (17 March 2023)
-# I've been focusing on making the ordinary non-OpenCL CPU implementation
-# faster and I got some gains, most importantly from multithreading.
-# There is Float16 support now, so I've added f16/f32 to these tables:
-#
-# I also managed to run LLaMA-65B for the first time.
-
-LLaMA-7B:  AMD Ryzen 3950X: 552ms / token     f16
-LLaMA-7B:  AMD Ryzen 3950X: 1008ms / token    f32
-LLaMA-13B: AMD Ryzen 3950X: 1029ms / token    f16
-LLaMA-13B: AMD Ryzen 3950X: 1930ms / token    f32
-LLaMA-30B: AMD Ryzen 5950X: 2112ms / token    f16
-LLaMA-65B: AMD Ryzen 5950X: 4186ms / token    f16
 ```
--- a/examples/api_hello_world.py
+++ b/examples/api_hello_world.py
@ -1,25 +0,0 @@
-#!/usr/bin/env python3
-
-"""
-This script uses the rllama API to generate tokens.
-
-It does not print the tokens nicely.
-"""
-
-import requests
-
-def main():
-    url = 'http://127.0.0.1:8080/rllama/v1/inference'
-    req = {
-        'prompt': 'Hello world!',
-        'max_seq_len': 1024,
-        'max_new_tokens': 200,
-        'no_token_sampling': False
-    }
-    res = requests.post(url, json=req, stream=True)
-    for line in res.iter_lines():
-        print(line.decode('utf-8'))
-
-
-if __name__ == '__main__':
-    main()
--- a/rllama.gif
+++ b/rllama.gif
--- a/src/benches/benchmark.rs
+++ b/src/benches/benchmark.rs
@ -14,34 +14,16 @@ pub fn opencl_benchmarks(c: &mut Criterion) {
    let cl = OpenCL::new(false, 0).unwrap();

    let mut mul_left = Tensor::random(1024, 1024, TensorDType::Float16);
-    mul_left.to_gpu_inplace(&cl).unwrap();
+    mul_left.to_gpu(&cl).unwrap();
    let mut mul_right = Tensor::random(1024, 1024, TensorDType::Float16);
-    mul_right.to_gpu_inplace(&cl).unwrap();
+    mul_right.to_gpu(&cl).unwrap();
    let mut mul_target = Tensor::zeros(1024, 1024, TensorDType::Float16);
-    mul_target.to_gpu_inplace(&cl).unwrap();
+    mul_target.to_gpu(&cl).unwrap();

    let mut mul_left_cpu = Tensor::random(1024, 1024, TensorDType::Float32);
    let mut mul_right_cpu = Tensor::random(1024, 1024, TensorDType::Float32);
    let mut mul_target_cpu = Tensor::random(1024, 1024, TensorDType::Float32);

-    let mut mul_left1 = Tensor::random(4096, 11000, TensorDType::Float16);
-    let mut mul_right1 = Tensor::random(1, 11000, TensorDType::Float16);
-    let mut mul_target1 = Tensor::zeros(4096, 1, TensorDType::Float16);
-    mul_left1.to_gpu_inplace(&cl).unwrap();
-    mul_right1.to_gpu_inplace(&cl).unwrap();
-    mul_target1.to_gpu_inplace(&cl).unwrap();
-
-    c.bench_function(
-        "4096x11000 to 1x11000 matrix multiplication transposed on OpenCL",
-        |b| {
-            b.iter(|| {
-                mul_target1
-                    .matrix_mul_inplace_transposed(black_box(&mul_left1), black_box(&mul_right1));
-                mul_target1.finish();
-            })
-        },
-    );
-
    c.bench_function(
        "1024x1024 matrix multiplication transposed on OpenCL",
        |b| {
@ -61,24 +43,24 @@ pub fn opencl_benchmarks(c: &mut Criterion) {

    c.bench_function("1x1 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig1.to_gpu_inplace(&cl).unwrap();
-            let _ = orig1.to_cpu_inplace();
+            let _ = orig1.to_gpu(&cl).unwrap();
+            let _ = orig1.to_cpu();
            orig1.finish();
        })
    });

    c.bench_function("1024x1024 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig16.to_gpu_inplace(&cl).unwrap();
-            let _ = orig16.to_cpu_inplace();
+            let _ = orig16.to_gpu(&cl).unwrap();
+            let _ = orig16.to_cpu();
            orig16.finish();
        })
    });

    c.bench_function("4096x4096 matrix from CPU to OpenCL device and back", |b| {
        b.iter(|| {
-            let _ = orig32.to_gpu_inplace(&cl).unwrap();
-            let _ = orig32.to_cpu_inplace();
+            let _ = orig32.to_gpu(&cl).unwrap();
+            let _ = orig32.to_cpu();
            orig32.finish();
        })
    });
@ -99,44 +81,27 @@ pub fn tensor_benchmarks(c: &mut Criterion) {
    let orig_84096_2 = Tensor::zeros(4096, 4096, TensorDType::Float32);
    let mut result_84096 = Tensor::zeros(8, 4096, TensorDType::Float32);

-    let orig_84096_1_f16 = Tensor::zeros(8, 4096, TensorDType::Float16);
-    let orig_84096_2_f16 = Tensor::zeros(4096, 4096, TensorDType::Float16);
-    let mut result_84096_f16 = Tensor::zeros(8, 4096, TensorDType::Float16);
-
    let orig_f32 = Tensor::zeros(1024, 1024, TensorDType::Float32);
    let orig_f16 = Tensor::zeros(1024, 1024, TensorDType::Float16);

-    let m1 = Tensor::random(1024, 128, TensorDType::Float32);
-    let m2 = Tensor::random(1, 128, TensorDType::Float32);
-    let m1_f16 = m1.to_f16();
-    let m2_f16 = m2.to_f16();
-
-    c.bench_function(
-        "1024x128 * 1x128 matrix vector transposed multiplication, f32",
-        |b| {
-            b.iter(|| {
-                let _ = m1.matrix_vector_mul_transposed(black_box(&m2));
-            })
-        },
-    );
+    c.bench_function("1024x1024 matrix from f32->f16", |b| {
+        b.iter(|| {
+            let _ = black_box(&orig_f32).to_f16();
+        })
+    });

-    c.bench_function(
-        "1024x128 * 1x128 matrix vector transposed multiplication, f16",
-        |b| {
-            b.iter(|| {
-                let _ = m1_f16.matrix_vector_mul_transposed(black_box(&m2_f16));
-            })
-        },
-    );
+    c.bench_function("1024x1024 matrix from f16->f32", |b| {
+        b.iter(|| {
+            let _ = black_box(&orig_f16).to_f32();
+        })
+    });

    c.bench_function(
-        "matrix multiplication 8x4096 @ 4096x4096 f16 in-place, transposed",
+        "matrix multiplication 8x4096 @ 4096x4096 f32 in-place",
        |b| {
            b.iter(|| {
-                let _ = result_84096_f16.matrix_mul_inplace_transposed(
-                    black_box(&orig_84096_1_f16),
-                    black_box(&orig_84096_2_f16),
-                );
+                let _ = result_84096
+                    .matrix_mul_inplace(black_box(&orig_84096_1), black_box(&orig_84096_2));
            })
        },
    );
@ -153,28 +118,6 @@ pub fn tensor_benchmarks(c: &mut Criterion) {
        },
    );

-    c.bench_function(
-        "matrix multiplication 8x4096 @ 4096x4096 f32 in-place",
-        |b| {
-            b.iter(|| {
-                let _ = result_84096
-                    .matrix_mul_inplace(black_box(&orig_84096_1), black_box(&orig_84096_2));
-            })
-        },
-    );
-
-    c.bench_function("1024x1024 matrix from f32->f16", |b| {
-        b.iter(|| {
-            let _ = black_box(&orig_f32).to_f16();
-        })
-    });
-
-    c.bench_function("1024x1024 matrix from f16->f32", |b| {
-        b.iter(|| {
-            let _ = black_box(&orig_f16).to_f32();
-        })
-    });
-
    c.bench_function("matrix multiplication f32 not in-place", |b| {
        b.iter(|| {
            let _ = black_box(&orig32_1).matrix_mul(black_box(&orig32_2));
--- a/src/lib.rs
+++ b/src/lib.rs
@ -1,11 +1,8 @@
 #![feature(stdsimd)]
-#![feature(decl_macro)]

 pub mod embedding;
 pub mod protomodels;
 pub mod rllama_main;
-pub mod semaphore;
-pub mod simd_support;
 pub mod tensor;
 #[cfg(feature = "opencl")]
 pub mod tensor_opencl_support;
@ -13,7 +10,3 @@ pub mod token_sampler;
 pub mod tokenizer;
 pub mod transformer;
 pub mod unpickler;
-pub mod weight_compression;
-#[cfg(feature = "server")]
-#[macro_use]
-extern crate rocket;
--- a/src/main.rs
+++ b/src/main.rs
@ -1,17 +1,3 @@
-#[cfg(not(target_feature = "avx2"))]
-compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
-#[cfg(not(target_feature = "sse2"))]
-compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
-#[cfg(not(target_feature = "fma"))]
-compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
-#[cfg(not(target_feature = "avx"))]
-compile_error!("This library assumes availability of AVX and must be compiled with -C target-feature=+sse2,+avx,+fma,+avx2");
-
-use mimalloc::MiMalloc;
-
-#[global_allocator]
-static GLOBAL: MiMalloc = MiMalloc;
-
 pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    rllama::rllama_main::main()
 }
--- a/src/rllama_main.rs
+++ b/src/rllama_main.rs
@ -1,23 +1,18 @@
 use crate::embedding::Embedding;
-use crate::semaphore::Semaphore;
 #[cfg(feature = "opencl")]
 use crate::tensor_opencl_support::OpenCL;
 use crate::token_sampler::TokenSampler;
 use crate::tokenizer::{TokenId, Tokenizer};
-use crate::transformer::{DataSettings, Transformer, TransformerCaches};
+use crate::transformer::{DataSettings, Transformer};
 use crate::unpickler;
 use crate::unpickler::Value;
 use clap::Parser;
 use colored::Colorize;
-#[cfg(feature = "server")]
-use rocket::{response::status, response::Stream, Data, State};
 use serde::{Deserialize, Serialize};
-use std::collections::BTreeMap;
 use std::io::{Read, Write};
 use std::path::PathBuf;
-use std::sync::{Arc, RwLock};

-#[derive(Parser, Clone)]
+#[derive(Parser)]
 #[command(author, version, about, long_about = None)]
 struct Cli {
    #[arg(long)]
@ -27,9 +22,6 @@ struct Cli {
    #[arg(long)]
    param_path: String,

-    #[arg(short, long, action)]
-    quiet: bool,
-
    #[arg(long)]
    prompt: Option<String>,
    #[arg(long)]
@ -44,39 +36,10 @@ struct Cli {
    top_p: Option<f32>,
    #[arg(long)]
    top_k: Option<i32>,
-    #[arg(long)]
-    repetition_penalty: Option<f32>,
-
-    #[arg(long)]
-    max_threads: Option<usize>,
-
-    #[arg(long, action)]
-    f16: bool,

    #[cfg(feature = "opencl")]
    #[arg(long)]
    opencl_device: Option<usize>,
-
-    #[arg(long, action)]
-    inference_server: bool,
-
-    #[arg(long)]
-    inference_server_port: Option<u16>,
-
-    #[arg(long)]
-    inference_server_host: Option<String>,
-
-    #[arg(long)]
-    inference_server_max_concurrent_inferences: Option<usize>,
-
-    #[arg(long)]
-    inference_server_api_path: Option<String>,
-
-    #[arg(long)]
-    inference_server_prompt_cache_size: Option<usize>,
-
-    #[arg(long, action)]
-    inference_server_exit_after_one_query: bool,
 }

 #[derive(Clone, Serialize, Deserialize)]
@ -91,46 +54,14 @@ struct ModelParams {

 pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cli = Cli::parse();
-    let model_path = cli.model_path.clone();
-    let tokenizer_path = cli.tokenizer_path.clone();
-    let param_path = cli.param_path.clone();
-
-    #[cfg(not(feature = "server"))]
-    if cli.inference_server {
-        eprintln!("Inference server is not enabled in this build.");
-        return Err("Inference server is not enabled in this build.".into());
-    }
-
-    let max_threads: usize = match cli.max_threads {
-        None => rayon::current_num_threads(),
-        Some(max_threads) => {
-            rayon::ThreadPoolBuilder::new()
-                .num_threads(max_threads)
-                .build_global()
-                .unwrap();
-            max_threads
-        }
-    };
+    let model_path = cli.model_path;
+    let tokenizer_path = cli.tokenizer_path;
+    let param_path = cli.param_path;

    let mut be_quiet: bool = false;
    if !colored::control::SHOULD_COLORIZE.should_colorize() {
        be_quiet = true;
    }
-    if cli.quiet {
-        be_quiet = true;
-    }
-    if be_quiet {
-        colored::control::SHOULD_COLORIZE.set_override(false);
-    }
-
-    // Custom println-like macro that respects be_quiet
-    macro_rules! pln {
-        ($($arg:tt)*) => {
-            if !be_quiet {
-                std::println!($($arg)*);
-            }
-        };
-    }

    #[cfg(feature = "opencl")]
    let opencl: Option<OpenCL> = {
@ -148,6 +79,15 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        }
    };

+    // Custom println-like macro that respects be_quiet
+    macro_rules! pln {
+        ($($arg:tt)*) => {
+            if !be_quiet {
+                std::println!($($arg)*);
+            }
+        };
+    }
+
    // Read ModelParams from param_path, we expect it to be JSON
    let mut fs = std::fs::File::open(&param_path)?;
    let mut bs = Vec::new();
@ -156,12 +96,12 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let params: ModelParams = serde_json::from_slice(&bs)?;
    pln!("Loaded model parameters from {}.", param_path);

-    let prompt: String = match (&cli.prompt, &cli.prompt_file) {
-        (Some(ref prompt), None) => {
+    let prompt: String = match (cli.prompt, cli.prompt_file) {
+        (Some(prompt), None) => {
            pln!("Using prompt: {}", prompt);
-            prompt.clone()
+            prompt
        }
-        (None, Some(ref prompt_file)) => {
+        (None, Some(prompt_file)) => {
            pln!("Using prompt file: {}", prompt_file);
            let mut fs = std::fs::File::open(prompt_file)?;
            let mut bs = Vec::new();
@ -170,12 +110,8 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
            String::from_utf8(bs)?
        }
        _ => {
-            if cli.inference_server {
-                "".to_string()
-            } else {
-                eprintln!("Please provide either a prompt or a prompt file.");
-                return Err("Please provide either a prompt or a prompt file.".into());
-            }
+            eprintln!("Please provide either a prompt or a prompt file.");
+            return Err("Please provide either a prompt or a prompt file.".into());
        }
    };

@ -214,9 +150,12 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    pln!("Loading embeddings from {}...", model_path);
    let emb = Embedding::from_unpickled(&unpickle_results, model_path.clone())?;

-    let max_seq_len = cli.max_seq_len.unwrap_or(1024);
+    let max_seq_len = match cli.max_seq_len {
+        Some(max_seq_len) => max_seq_len,
+        None => 1024,
+    };

-    let mut data_settings = {
+    let data_settings = {
        #[cfg(feature = "opencl")]
        {
            if let Some(opencl) = opencl {
@ -230,10 +169,6 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        DataSettings::new()
    };

-    if cli.f16 {
-        data_settings = data_settings.force_f16();
-    }
-
    pln!("Loading transformer weights from {}...", model_path);
    let tr = Transformer::from_unpickled(
        &unpickle_results,
@ -248,455 +183,9 @@ pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    )?;
    pln!("All is loaded. Starting inference.");

-    let tr: Arc<Transformer> = Arc::new(tr);
-    let tok: Arc<Tokenizer> = Arc::new(tok);
-
-    if cli.inference_server {
-        #[cfg(feature = "server")]
-        {
-            server_inference(cli, tr, tok, be_quiet, max_seq_len, params, max_threads)
-        }
-        #[cfg(not(feature = "server"))]
-        {
-            eprintln!("The inference server feature is not enabled.");
-            eprintln!("Please enable it with the \"inference-server\" feature.");
-            Err("The inference server feature is not enabled.".into())
-        }
-    } else {
-        command_line_inference(
-            cli.clone(),
-            tr.clone(),
-            tok.clone(),
-            prompt.clone(),
-            be_quiet,
-            max_seq_len,
-            params.clone(),
-            max_threads,
-        )
-    }
-}
-
-#[cfg(feature = "server")]
-fn server_inference(
-    cli: Cli,
-    tr: Arc<Transformer>,
-    tok: Arc<Tokenizer>,
-    be_quiet: bool,
-    max_seq_len: usize,
-    _params: ModelParams,
-    _max_threads: usize,
-) -> Result<(), Box<dyn std::error::Error>> {
-    macro_rules! pln {
-        ($($arg:tt)*) => {
-            if !be_quiet {
-                std::println!($($arg)*);
-            }
-        };
-    }
-
-    let inference_server_port = cli.inference_server_port.unwrap_or(8080);
-    let inference_server_host = cli
-        .inference_server_host
-        .clone()
-        .unwrap_or("127.0.0.1".to_string());
-    let inference_server_max_concurrent_inferences =
-        cli.inference_server_max_concurrent_inferences.unwrap_or(5);
-    let inference_server_api_path = cli
-        .inference_server_api_path
-        .clone()
-        .unwrap_or("/rllama/v1/inference".to_string());
-    let inference_server_prompt_cache_size = cli.inference_server_prompt_cache_size.unwrap_or(50);
-
-    pln!(
-        "Maximum concurrent inferences: {}",
-        inference_server_max_concurrent_inferences
-    );
-    pln!("Prompt cache size: {}", inference_server_prompt_cache_size);
-    pln!("Maximum sequence length: {}", max_seq_len);
-    pln!(
-        "--- Starting HTTP server on {}:{}, answering to requests at {} ---",
-        inference_server_host,
-        inference_server_port,
-        inference_server_api_path
-    );
-
-    // If there are too many connections, they will hang until they get their turn.
-    // Maybe can later implement return 503 slow down or something similar.
-    let concurrent_requests_semaphore = Semaphore::new(inference_server_max_concurrent_inferences);
-
-    let rocket_conf = rocket::Config::build(rocket::config::Environment::Production)
-        .address(inference_server_host)
-        .port(inference_server_port)
-        .finalize()
-        .unwrap();
-
-    let app = rocket::custom(rocket_conf)
-        .mount(&inference_server_api_path, routes![handle_request])
-        .manage(InferenceServerState {
-            transformer: tr,
-            tokenizer: tok,
-            max_seq_len,
-            concurrent_requests_semaphore,
-            attention_cache_repository: Arc::new(RwLock::new(AttentionCacheRepository::empty(
-                inference_server_prompt_cache_size,
-            ))),
-            exit_after_one_query: cli.inference_server_exit_after_one_query,
-        });
-
-    app.launch();
-    panic!("Starting web server failed.");
-}
-
-fn is_false(b: &bool) -> bool {
-    !b
-}
-
-#[derive(Serialize, Deserialize, Clone, Debug)]
-struct InferenceRequest {
-    temperature: Option<f32>,
-    top_k: Option<usize>,
-    top_p: Option<f32>,
-    repetition_penalty: Option<f32>,
-    max_seq_len: Option<usize>,
-    max_new_tokens: Option<usize>,
-    no_token_sampling: Option<bool>,
-    stop_at_end_token: Option<bool>,
-    prompt: String,
-}
-
-#[cfg(feature = "server")]
-#[derive(Serialize, Deserialize, Clone, Debug)]
-struct PredResult {
-    p: f32,
-    #[serde(skip_serializing_if = "is_false")]
-    is_end_token: bool,
-}
-
-#[cfg(feature = "server")]
-struct GeneratingSession {
-    transformer: Arc<Transformer>,
-    token_sampler: TokenSampler,
-    tokenizer: Arc<Tokenizer>,
-    attention_cache_repository: Arc<RwLock<AttentionCacheRepository>>,
-    tokens: Vec<TokenId>,
-    req_max_seq_len: usize,
-    req_max_new_tokens: usize,
-    new_tokens_generated: usize,
-    prev_pos: usize,
-    no_token_sampling: bool,
-    stop_at_end_token: bool,
-    sent_stuff_last_time: bool,
-    exit_after_one_query: bool,
-    result: Vec<u8>, // stores JSONL lines to be returned from read()
-}
-
-#[cfg(feature = "server")]
-impl GeneratingSession {
-    fn read_from_result(&mut self, buf: &mut [u8]) -> usize {
-        if !self.result.is_empty() {
-            if self.result.len() <= buf.len() {
-                for idx in 0..self.result.len() {
-                    buf[idx] = self.result[idx];
-                }
-                let len = self.result.len();
-                self.sent_stuff_last_time = true;
-                self.result.truncate(0);
-                return len;
-            } else {
-                for idx in 0..buf.len() {
-                    buf[idx] = self.result[idx];
-                }
-                self.result = self.result[buf.len()..].to_vec();
-                self.sent_stuff_last_time = true;
-                return buf.len();
-            }
-        }
-        return 0;
-    }
-}
-
-#[cfg(feature = "server")]
-impl Read for GeneratingSession {
-    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
-        if self.sent_stuff_last_time && self.result.is_empty() {
-            // If we return WouldBlock every time we send something, it'll cause Rocket to
-            // flush available data.
-            self.sent_stuff_last_time = false;
-            return Err(std::io::Error::new(
-                std::io::ErrorKind::WouldBlock,
-                "WouldBlock",
-            ));
-        }
-
-        // Push more data to the upstream if we have something stored.
-        let bytes_read = self.read_from_result(buf);
-        if bytes_read > 0 {
-            return Ok(bytes_read);
-        }
-        if self.tokens.len() >= self.req_max_seq_len {
-            if self.exit_after_one_query {
-                std::process::exit(0);
-            }
-            return Ok(0);
-        }
-        if self.new_tokens_generated >= self.req_max_new_tokens {
-            if self.exit_after_one_query {
-                std::process::exit(0);
-            }
-            return Ok(0);
-        }
-
-        let (mut caches, update_pos) = {
-            let mut ac = self.attention_cache_repository.write().unwrap();
-            match ac.get(&self.tokens) {
-                Some((c, pos)) if pos >= self.prev_pos => (c.true_clone(), pos),
-                Some(_) => {
-                    std::mem::drop(ac);
-                    (self.transformer.make_caches(), 0)
-                }
-                None => {
-                    let caches = self.transformer.make_caches();
-                    ac.put(self.tokens.clone(), caches.true_clone(), self.prev_pos);
-                    (caches, self.prev_pos)
-                }
-            }
-        };
-        if update_pos > self.prev_pos {
-            self.prev_pos = update_pos;
-        }
-
-        assert!(self.result.is_empty());
-        let predictions =
-            self.transformer
-                .forward(&self.tokens[self.prev_pos..], self.prev_pos, &mut caches);
-        self.prev_pos = self.tokens.len();
-        let (highest_pred_idx, token_prob) =
-            self.token_sampler
-                .sample(&predictions, self.tokenizer.as_ref(), &self.tokens);
-        self.tokens.push(highest_pred_idx as TokenId);
-        {
-            let mut ac = self.attention_cache_repository.write().unwrap();
-            ac.put(self.tokens.clone(), caches, self.prev_pos);
-        }
-        self.new_tokens_generated += 1;
-        let token: &str = self.tokenizer.id_to_str(highest_pred_idx as TokenId);
-        let mut is_end_token: bool = false;
-        if token == "</s>" && self.stop_at_end_token {
-            self.new_tokens_generated = self.req_max_new_tokens;
-            is_end_token = true;
-        }
-
-        let mut result: BTreeMap<String, PredResult> = BTreeMap::new();
-        if self.no_token_sampling {
-            // All predictions go the line.
-            let probs = self
-                .token_sampler
-                .logits_to_btreemap(&predictions, self.tokenizer.as_ref());
-            for (k, v) in probs.into_iter() {
-                let mut is_end_token: bool = false;
-                if k == "</s>" {
-                    is_end_token = true;
-                }
-                result.insert(
-                    k,
-                    PredResult {
-                        p: v,
-                        is_end_token: is_end_token,
-                    },
-                );
-            }
-            // Convert to JSON
-            let json = serde_json::to_string(&result).unwrap();
-            self.result.extend(json.as_bytes());
-            self.result.push(b'\n');
-            return Ok(self.read_from_result(buf));
-        } else {
-            result.insert(
-                token.to_string(),
-                PredResult {
-                    p: token_prob,
-                    is_end_token,
-                },
-            );
-            let json = serde_json::to_string(&result).unwrap();
-            self.result.extend(json.as_bytes());
-            self.result.push(b'\n');
-            return Ok(self.read_from_result(buf));
-        }
-    }
-}
-
-#[cfg(feature = "server")]
-struct AttentionCacheRepository {
-    caches: BTreeMap<Vec<TokenId>, (TransformerCaches, usize, std::time::Instant)>,
-    max_sz: usize,
-}
-
-#[cfg(feature = "server")]
-impl AttentionCacheRepository {
-    fn empty(max_size: usize) -> AttentionCacheRepository {
-        AttentionCacheRepository {
-            caches: BTreeMap::new(),
-            max_sz: max_size,
-        }
-    }
-
-    /// Makes sure the cache repository is not larger than sz, evicts any older items.
-    fn limit_size(&mut self, sz: usize) {
-        if sz == 0 {
-            self.caches = BTreeMap::new();
-            return;
-        }
-        // Slow algorithm but I guess our cache will never be unimaginably large so it's probably
-        // fine
-        while self.caches.len() > sz {
-            let mut oldest_time = None;
-            let mut oldest_key: Option<&Vec<TokenId>> = None;
-            for (k, (_, _, time)) in self.caches.iter() {
-                if oldest_time.is_none() || time < oldest_time.unwrap() {
-                    oldest_time = Some(time);
-                    oldest_key = Some(k);
-                }
-            }
-            let oldest_key = oldest_key.unwrap().clone();
-            self.caches.remove(&oldest_key);
-        }
-    }
-
-    fn get(&self, tokens: &[TokenId]) -> Option<(&TransformerCaches, usize)> {
-        if let Some((caches, pos, _)) = self.caches.get(tokens) {
-            Some((caches, *pos))
-        } else {
-            None
-        }
-    }
-
-    fn put(&mut self, tokens: Vec<TokenId>, caches: TransformerCaches, prev_pos: usize) {
-        self.caches
-            .insert(tokens, (caches, prev_pos, std::time::Instant::now()));
-        self.limit_size(self.max_sz);
-    }
-}
-
-#[cfg(feature = "server")]
-#[derive(Clone)]
-struct InferenceServerState {
-    transformer: Arc<Transformer>,
-    tokenizer: Arc<Tokenizer>,
-    max_seq_len: usize,
-    concurrent_requests_semaphore: Semaphore,
-    attention_cache_repository: Arc<RwLock<AttentionCacheRepository>>,
-    exit_after_one_query: bool,
-}
-
-#[cfg(feature = "server")]
-#[post("/", data = "<input>")]
-fn handle_request(
-    state: State<InferenceServerState>,
-    input: Data,
-) -> Result<Stream<GeneratingSession>, status::BadRequest<String>> {
-    let _lock = state.concurrent_requests_semaphore.acquire();
-    let tr = state.transformer.clone();
-    let tok = state.tokenizer.clone();
-
-    let mut data = input.open();
-    let mut databuf: Vec<u8> = Vec::new();
-    data.read_to_end(&mut databuf).unwrap();
-
-    // Parse the JSON out of the request
-    let request: InferenceRequest = match serde_json::from_slice(&databuf) {
-        Err(_e) => {
-            return Err(status::BadRequest(Some("Invalid JSON.".to_string())));
-        }
-        Ok(ir) => ir,
-    };
-
-    let stop_at_end_token = request.stop_at_end_token.unwrap_or(true);
-    let temperature = request.temperature.unwrap_or(1.0);
-    let top_k = request.top_k.unwrap_or(20);
-    let top_p = request.top_p.unwrap_or(1.0);
-    let repetition_penalty = request.repetition_penalty.unwrap_or(1.0);
-    let mut req_max_seq_len = request.max_seq_len.unwrap_or(state.max_seq_len);
-    if req_max_seq_len > state.max_seq_len {
-        req_max_seq_len = state.max_seq_len;
-    }
-    let req_max_new_tokens = request.max_new_tokens.unwrap_or(20);
-    let no_token_sampling = request.no_token_sampling.unwrap_or(false);
-    let prompt = request.prompt;
-
-    if temperature.is_nan() {
-        return Err(status::BadRequest(Some(
-            "Temperature must be a number.".to_string(),
-        )));
-    }
-    if top_k == 0 {
-        return Err(status::BadRequest(Some(
-            "Top-k must be greater than 0.".to_string(),
-        )));
-    }
-    if top_p.is_nan() {
-        return Err(status::BadRequest(Some(
-            "Top-p must be a number.".to_string(),
-        )));
-    }
-    if repetition_penalty.is_nan() {
-        return Err(status::BadRequest(Some(
-            "Repetition penalty must be a number.".to_string(),
-        )));
-    }
-
-    let token_sampler = TokenSampler::new()
-        .temperature(temperature)
-        .top_p(top_p)
-        .top_k(top_k)
-        .repetition_penalty(repetition_penalty);
-    let toks_id: Vec<TokenId> = tok.tokenize_to_ids(prompt.clone());
-    let gsession = GeneratingSession {
-        transformer: tr,
-        tokenizer: tok,
-        attention_cache_repository: state.attention_cache_repository.clone(),
-        token_sampler: token_sampler,
-        tokens: toks_id,
-        req_max_seq_len: req_max_seq_len,
-        req_max_new_tokens: req_max_new_tokens,
-        new_tokens_generated: 0,
-        prev_pos: 0,
-        no_token_sampling: no_token_sampling,
-        stop_at_end_token: stop_at_end_token,
-        sent_stuff_last_time: false,
-        exit_after_one_query: state.exit_after_one_query,
-        result: Vec::new(),
-    };
-
-    return Ok(rocket::response::Stream::chunked(gsession, 1024));
-}
-
-fn command_line_inference(
-    cli: Cli,
-    tr: Arc<Transformer>,
-    tok: Arc<Tokenizer>,
-    prompt: String,
-    be_quiet: bool,
-    max_seq_len: usize,
-    params: ModelParams,
-    max_threads: usize,
-) -> Result<(), Box<dyn std::error::Error>> {
-    // Custom println-like macro that respects be_quiet
-    macro_rules! pln {
-        ($($arg:tt)*) => {
-            if !be_quiet {
-                std::println!($($arg)*);
-            }
-        };
-    }
-
    let mut toks_id: Vec<TokenId> = tok.tokenize_to_ids(prompt.clone());
    let mut prev_pos = 0;
-    let mut token_sampler = TokenSampler::new()
-        .temperature(1.0)
-        .top_p(1.0)
-        .top_k(20)
-        .repetition_penalty(1.0);
+    let mut token_sampler = TokenSampler::new().temperature(0.8).top_p(0.9).top_k(50);

    if let Some(temperature) = cli.temperature {
        token_sampler = token_sampler.temperature(temperature);
@ -707,9 +196,6 @@ fn command_line_inference(
    if let Some(top_k) = cli.top_k {
        token_sampler = token_sampler.top_k(top_k as usize);
    }
-    if let Some(repetition_penalty) = cli.repetition_penalty {
-        token_sampler = token_sampler.repetition_penalty(repetition_penalty);
-    }

    pln!("---");
    pln!(" dim: {}", params.dim);
@ -719,16 +205,10 @@ fn command_line_inference(
    pln!(" norm_eps: {}", params.norm_eps);
    pln!(" vocab_size: {}", params.vocab_size);
    pln!("---");
-    pln!(" maximum number of threads: {}", max_threads);
-    pln!("---");
    pln!("Max sequence length: {}", max_seq_len);
    pln!("Temperature: {}", token_sampler.get_temperature());
    pln!("Top P: {}", token_sampler.get_top_p());
    pln!("Top K: {}", token_sampler.get_top_k());
-    pln!(
-        "Repetition penalty: {}",
-        token_sampler.get_repetition_penalty()
-    );
    pln!("---");
    pln!(
        "{}",
@ -746,12 +226,11 @@ fn command_line_inference(
    let mut times_per_token: Vec<std::time::Duration> = vec![];
    let mut caches = tr.make_caches();
    let mut first: bool = true;
-    let mut stop_seen: bool = false;
    while toks_id.len() < max_seq_len {
        let now = std::time::Instant::now();
-        let preds = tr.forward(&toks_id[prev_pos..], prev_pos, &mut caches);

-        let (highest_pred_idx, token_prob) = token_sampler.sample(&preds, &tok, &toks_id);
+        let preds = tr.forward(&toks_id[prev_pos..], prev_pos, &mut caches);
+        let highest_pred_idx = token_sampler.sample(&preds);
        toks_id.push(highest_pred_idx as TokenId);

        for (tok_idx, tok_id) in toks_id[prev_pos + 1..].iter().enumerate() {
@ -760,10 +239,6 @@ fn command_line_inference(
            }
            let mut tok_str: String = "".to_string();
            let tok = tok.id_to_str(*tok_id);
-            if tok == "</s>" {
-                tok_str += "";
-                stop_seen = true;
-            }
            if tok == "<0x0A>" {
                tok_str += "\n";
            } else {
@ -772,18 +247,7 @@ fn command_line_inference(
            if first && tok_idx < toks_id.len() - 2 {
                // intentionally left empty
            } else {
-                let redness: f32 = token_prob * 255.0;
-                let redness = if redness > 255.0 {
-                    255
-                } else if redness < 0.0 {
-                    0
-                } else {
-                    redness as u8
-                };
-                print!(
-                    "{}",
-                    tok_str.truecolor(128 + redness / 2, 255 - redness / 2, 128)
-                );
+                print!("{}", tok_str.truecolor(128, 255, 128));
            }
        }
        if first {
@ -794,14 +258,8 @@ fn command_line_inference(
        let _ = std::io::stdout().flush();
        prev_pos = toks_id.len() - 1;
        first = false;
-        if stop_seen {
-            break;
-        }
-    }
-    println!();
-    if stop_seen && !be_quiet {
-        println!("Stop token seen. Stopping.");
    }
+    println!("");
    if !be_quiet {
        println!("---");
        println!(
--- a/src/semaphore.rs
+++ b/src/semaphore.rs
@ -1,38 +0,0 @@
-// There is no semaphore in Rust standard library. wat??
-// So I've made a simple one I can use out of a mutex and condition variable..
-
-use std::sync::{Arc, Condvar, Mutex, MutexGuard};
-
-#[derive(Clone)]
-pub struct Semaphore {
-    count: Arc<Mutex<usize>>,
-    waiters: Arc<Condvar>,
-}
-
-pub struct SemaphoreGuard<'a> {
-    mutex_guard: MutexGuard<'a, usize>,
-}
-
-impl<'a> Drop for SemaphoreGuard<'a> {
-    fn drop(&mut self) {
-        *self.mutex_guard += 1;
-    }
-}
-
-impl Semaphore {
-    pub fn new(count: usize) -> Semaphore {
-        Semaphore {
-            count: Arc::new(Mutex::new(count)),
-            waiters: Arc::new(Condvar::new()),
-        }
-    }
-
-    pub fn acquire(&self) -> SemaphoreGuard {
-        let mut count = self.count.lock().unwrap();
-        while *count == 0 {
-            count = self.waiters.wait(count).unwrap();
-        }
-        *count -= 1;
-        SemaphoreGuard { mutex_guard: count }
-    }
-}
--- a/src/simd_support.rs
+++ b/src/simd_support.rs
@ -1,116 +0,0 @@
-// This file contains platform-specific SIMD so that rest of rllama does not need to care which
-// platform it is on.
-
-use core::arch::x86_64::*;
-use half::f16;
-
-pub type I32x8 = __m256i;
-pub type F32x8 = __m256;
-pub type I16x8 = __m128i;
-
-/* ------------------ */
-/* Loading and storing things */
-/* ------------------ */
-
-#[inline]
-pub fn load_i16x8(ptr: *const I16x8) -> I16x8 {
-    unsafe { _mm_loadu_si128(ptr) }
-}
-
-#[inline]
-pub fn store_i16x8(ptr: *mut I16x8, a: I16x8) {
-    unsafe { _mm_storeu_si128(ptr, a) }
-}
-
-#[inline]
-pub fn load_f32x8(ptr: *const F32x8) -> F32x8 {
-    unsafe { _mm256_loadu_ps(ptr as *const f32) }
-}
-
-#[inline]
-pub fn store_f32x8(ptr: *mut F32x8, a: F32x8) {
-    unsafe { _mm256_storeu_ps(ptr as *mut f32, a) }
-}
-
-#[inline]
-pub fn gather_f32x8(ptr: *const f32, indices: I32x8) -> F32x8 {
-    unsafe { _mm256_i32gather_ps(ptr, indices, 1) }
-}
-
-/* ------------------ */
-/* Conversions        */
-/* ------------------ */
-
-#[inline]
-pub fn i16x8_as_f16_to_f32x8(a: I16x8) -> F32x8 {
-    unsafe { _mm256_cvtph_ps(a) }
-}
-
-#[inline]
-pub fn f32x8_to_i16x8_as_f16(a: F32x8) -> I16x8 {
-    unsafe { _mm256_cvtps_ph(a, 0) }
-}
-
-/*
- * Constants, creating from constants
- */
-
-pub fn f32x8_zero() -> F32x8 {
-    unsafe { _mm256_setzero_ps() }
-}
-
-pub fn i16x8_zero() -> I16x8 {
-    unsafe { _mm_setzero_si128() }
-}
-
-pub fn f32x8_singleton(value: f32) -> F32x8 {
-    unsafe { _mm256_set1_ps(value) }
-}
-
-pub fn i32x8_from_values(
-    val0: i32,
-    val1: i32,
-    val2: i32,
-    val3: i32,
-    val4: i32,
-    val5: i32,
-    val6: i32,
-    val7: i32,
-) -> I32x8 {
-    unsafe { _mm256_set_epi32(val0, val1, val2, val3, val4, val5, val6, val7) }
-}
-
-/*
- * Operations
- */
-
-// FMA
-
-// a * b + c
-pub fn fma_f32x8(a: F32x8, b: F32x8, c: F32x8) -> F32x8 {
-    unsafe { _mm256_fmadd_ps(a, b, c) }
-}
-
-// Horizontal sums
-
-#[inline]
-pub fn horizontal_sum_f32x8(mut ymm: __m256) -> f32 {
-    unsafe {
-        let ymm2 = _mm256_permute2f128_ps(ymm, ymm, 1);
-        ymm = _mm256_add_ps(ymm, ymm2);
-        ymm = _mm256_hadd_ps(ymm, ymm);
-        ymm = _mm256_hadd_ps(ymm, ymm);
-        _mm256_cvtss_f32(ymm)
-    }
-}
-
-#[inline]
-pub fn horizontal_sum_and_f32_to_f16(mut ymm: __m256) -> f16 {
-    unsafe {
-        let ymm2 = _mm256_permute2f128_ps(ymm, ymm, 1);
-        ymm = _mm256_add_ps(ymm, ymm2);
-        ymm = _mm256_hadd_ps(ymm, ymm);
-        ymm = _mm256_hadd_ps(ymm, ymm);
-        f16::from_f32(_mm256_cvtss_f32(ymm))
-    }
-}
--- a/src/tensor.rs
+++ b/src/tensor.rs
--- a/src/tensor_opencl_support.rs
+++ b/src/tensor_opencl_support.rs
@ -2,10 +2,7 @@
 * OpenCL stuff to run (some) of the tensor operations.
 */

-use ocl::{
-    enums::DeviceInfo, enums::DeviceInfoResult, Buffer, Context, Device, DeviceType, Event, Kernel,
-    Platform, Program, Queue,
-};
+use ocl::{Buffer, Context, Device, Event, Kernel, Platform, Program, Queue};
 use std::alloc::Layout;
 use std::sync::{Arc, RwLock};
 use thiserror::Error;
@ -13,16 +10,28 @@ use thiserror::Error;
 #[derive(Debug)]
 #[allow(dead_code)]
 struct Programs {
-    matrix_mul_transposed_f16_program: Program,
-    matrix_mul_transposed_f16: Kernel,
-    matrix_mul_transposed_f16_cpu_optimized_program: Program,
-    matrix_mul_transposed_f16_cpu_optimized: Kernel,
+    matrix_mul_transposed_by_row_f16_program: Program,
+    matrix_mul_transposed_by_row_f16: Kernel,
    silu_f16_program: Program,
    silu_f16: Kernel,
    hadamard_product_f16_program: Program,
    hadamard_product_f16: Kernel,
    transpose_f16_program: Program,
    transpose_f16: Kernel,
+    pow_f16_program: Program,
+    pow_f16: Kernel,
+    mean_cols_f16_program: Program,
+    mean_cols_f16: Kernel,
+    add_scalar_f16_program: Program,
+    add_scalar_f16: Kernel,
+    scalar_multiply_broadcast_f16_program: Program,
+    scalar_multiply_broadcast_f16: Kernel,
+    hadamard_product_broadcast_f16_program: Program,
+    hadamard_product_broadcast_f16: Kernel,
+    rsqrt_f16_program: Program,
+    rsqrt_f16: Kernel,
+    add_f16_program: Program,
+    add_f16: Kernel,
 }

 #[derive(Debug, Clone)]
@ -31,7 +40,6 @@ pub struct OpenCL {
    ctx: Context,
    queue: Queue,
    programs: Arc<RwLock<Programs>>,
-    is_cpu_device: bool,
 }

 #[derive(Debug)]
@ -110,18 +118,12 @@ impl OpenCL {
            .devices(devices[nth_device].1)
            .build()?;

-        let is_cpu_device = match devices[nth_device].1.info(DeviceInfo::Type)? {
-            DeviceInfoResult::Type(DeviceType::CPU) => true,
-            _ => false,
-        };
-
        let queue = Queue::new(&ctx, devices[nth_device].1, None)?;
        let programs = make_programs(&ctx, &queue)?;
        Ok(OpenCL {
            ctx: ctx,
            queue: queue,
            programs: Arc::new(RwLock::new(programs)),
-            is_cpu_device,
        })
    }

@ -229,6 +231,58 @@ impl OpenCLTensor {
        Ok(OpenCLEvent { event })
    }

+    pub fn add_scalar_inplace(&mut self, scalar: f32) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.add_scalar_f16.set_arg(0, self.buf.clone()).unwrap();
+        prg.add_scalar_f16
+            .set_arg(1, self.cols_capacity as i32)
+            .unwrap();
+        prg.add_scalar_f16.set_arg(2, scalar).unwrap();
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .add_scalar_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, self.cols as usize])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
+    pub fn scalar_multiply_broadcast_inplace(
+        &mut self,
+        other: &OpenCLTensor,
+    ) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.scalar_multiply_broadcast_f16
+            .set_arg(0, self.buf.clone())
+            .unwrap();
+        prg.scalar_multiply_broadcast_f16
+            .set_arg(1, other.buf.clone())
+            .unwrap();
+        prg.scalar_multiply_broadcast_f16
+            .set_arg(2, self.cols_capacity as i32)
+            .unwrap();
+        prg.scalar_multiply_broadcast_f16
+            .set_arg(3, other.cols_capacity as i32)
+            .unwrap();
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .scalar_multiply_broadcast_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, (self.cols_capacity / 16) as usize])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
    pub fn transpose_from(&mut self, other: &OpenCLTensor) -> Result<OpenCLEvent, OpenCLError> {
        let prg = self.cl.programs.write().unwrap();
        prg.transpose_f16.set_arg(0, self.buf.clone()).unwrap();
@ -247,7 +301,7 @@ impl OpenCLTensor {
                .queue(&self.queue)
                .global_work_size([self.rows as usize, self.cols as usize])
                .enew(&mut event);
-            b.enq().unwrap();
+            b.enq()?;
        }
        self.last_event = Some(event.clone());
        Ok(OpenCLEvent { event })
@ -278,6 +332,85 @@ impl OpenCLTensor {
        Ok(OpenCLEvent { event })
    }

+    pub fn hadamard_product_broadcast_inplace(
+        &mut self,
+        other: &OpenCLTensor,
+    ) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.hadamard_product_broadcast_f16
+            .set_arg(0, self.buf.clone())?;
+        prg.hadamard_product_broadcast_f16
+            .set_arg(1, other.buf.clone())?;
+        prg.hadamard_product_broadcast_f16
+            .set_arg(2, self.cols_capacity as i32)?;
+        prg.hadamard_product_broadcast_f16
+            .set_arg(3, other.cols_capacity as i32)?;
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .hadamard_product_broadcast_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, (self.cols_capacity as usize) / 16])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
+    pub fn mean_cols_from(&mut self, other: &OpenCLTensor) -> Result<OpenCLEvent, OpenCLError> {
+        if self.cols != 1 {
+            panic!(
+                "mean_cols_from: number of columns in target is not 1: {}",
+                self.cols
+            );
+        }
+        if self.rows != other.rows {
+            panic!(
+                "mean_cols_from: number of rows in target is not equal to number of rows in source: {} != {}",
+                self.rows, other.rows
+            );
+        }
+        let prg = self.cl.programs.write().unwrap();
+        prg.mean_cols_f16.set_arg(0, self.buf.clone())?;
+        prg.mean_cols_f16.set_arg(1, other.buf.clone())?;
+        prg.mean_cols_f16.set_arg(2, self.cols_capacity as i32)?;
+        prg.mean_cols_f16.set_arg(3, other.cols_capacity as i32)?;
+        prg.mean_cols_f16.set_arg(4, other.cols as i32)?;
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .mean_cols_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, 1])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
+    pub fn pow_inplace(&mut self, scalar: f32) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.pow_f16.set_arg(0, self.buf.clone())?;
+        prg.pow_f16.set_arg(1, self.cols_capacity as i32)?;
+        prg.pow_f16.set_arg(2, scalar)?;
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .pow_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, self.cols as usize])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
    pub fn silu_inplace(&mut self) -> Result<OpenCLEvent, OpenCLError> {
        let prg = self.cl.programs.write().unwrap();
        prg.silu_f16.set_arg(0, self.buf.clone())?;
@ -296,6 +429,44 @@ impl OpenCLTensor {
        Ok(OpenCLEvent { event })
    }

+    pub fn add_inplace(&mut self, left: &OpenCLTensor) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.add_f16.set_arg(0, self.buf.clone())?;
+        prg.add_f16.set_arg(1, left.buf.clone())?;
+        prg.add_f16.set_arg(2, self.cols_capacity as i32)?;
+        prg.add_f16.set_arg(3, left.cols_capacity as i32)?;
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .add_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, self.cols as usize])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
+    pub fn rsqrt_inplace(&mut self) -> Result<OpenCLEvent, OpenCLError> {
+        let prg = self.cl.programs.write().unwrap();
+        prg.rsqrt_f16.set_arg(0, self.buf.clone())?;
+        prg.rsqrt_f16.set_arg(1, self.cols_capacity as i32)?;
+        let mut event = Event::empty();
+        unsafe {
+            let b = prg
+                .rsqrt_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, self.cols as usize])
+                .enew(&mut event);
+            b.enq()?;
+        }
+        self.last_event = Some(event.clone());
+        Ok(OpenCLEvent { event })
+    }
+
    pub fn matrix_mul_inplace_transposed(
        &mut self,
        src: &OpenCLTensor,
@ -314,69 +485,38 @@ impl OpenCLTensor {
            );
        }

-        // Clear out the target memory.
+        // Clear out the target memory
        unsafe { self.buf.cmd().fill(0u16, None).block(false).enq()? };

        let prg = self.cl.programs.write().unwrap();
-
-        // 0 = CPU optimized
-        // 1 = GPU optimized
-        // 2 = GPU optimized vector multiply (other.rows == 1)
-        const CPU: u8 = 0;
-        const GPU: u8 = 1;
-        let strategy: u8 = if self.cl.is_cpu_device { CPU } else { GPU };
-
-        let prg = if strategy == CPU {
-            &prg.matrix_mul_transposed_f16_cpu_optimized
-        } else {
-            &prg.matrix_mul_transposed_f16
-        };
-        prg.set_arg(0, self.buf.clone())?;
-        prg.set_arg(1, src.buf.clone())?;
-        prg.set_arg(2, other.buf.clone())?;
-        prg.set_arg(3, src.cols_capacity as i32)?;
-        prg.set_arg(4, other.cols_capacity as i32)?;
-        prg.set_arg(5, self.cols_capacity as i32)?;
-        prg.set_arg(6, self.rows as i32)?;
-        prg.set_arg(7, self.cols as i32)?;
-        prg.set_arg(8, src.cols as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(0, self.buf.clone())?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(1, src.buf.clone())?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(2, other.buf.clone())?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(3, src.cols_capacity as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(4, other.cols_capacity as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(5, self.cols_capacity as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(6, self.rows as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(7, self.cols as i32)?;
+        prg.matrix_mul_transposed_by_row_f16
+            .set_arg(8, src.cols as i32)?;
        let mut event = Event::empty();

-        let rows16 = if self.rows % 16 == 0 {
-            self.rows
-        } else {
-            self.rows + 16 - (self.rows % 16)
-        };
-        let cols16 = if self.cols % 16 == 0 {
-            self.cols
-        } else {
-            self.cols + 16 - (self.cols % 16)
-        };
-
        unsafe {
-            if strategy == CPU {
-                let b = prg
-                    .cmd()
-                    .queue(&self.queue)
-                    .global_work_size([self.cols as usize, self.rows as usize])
-                    .enew(&mut event);
-                b.enq()?;
-            } else if strategy == GPU {
-                let b = prg
-                    .cmd()
-                    .queue(&self.queue)
-                    .global_work_size([cols16 as usize, rows16 as usize])
-                    .local_work_size([16, 16])
-                    .enew(&mut event);
-                b.enq()?;
-            } else {
-                let b = prg
-                    .cmd()
-                    .queue(&self.queue)
-                    .global_work_size([self.cols as usize, self.rows as usize])
-                    .enew(&mut event);
-                b.enq()?;
-            }
+            let b = prg
+                .matrix_mul_transposed_by_row_f16
+                .cmd()
+                .queue(&self.queue)
+                .global_work_size([self.rows as usize, self.cols as usize])
+                .enew(&mut event);
+            b.enq()?;
        }
        self.last_event = Some(event.clone());
        Ok(OpenCLEvent { event })
@ -396,27 +536,11 @@ fn make_programs(ctx: &Context, queue: &Queue) -> Result<Programs, OpenCLError>
        Ok(program)
    }

-    let matrix_mul_transposed_f16_program =
-        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_F16_SRC)?;
-    let matrix_mul_transposed_f16 = Kernel::builder()
-        .program(&matrix_mul_transposed_f16_program)
-        .name("matrix_mul_transposed_f16")
-        .arg(None::<&Buffer<u16>>)
-        .arg(None::<&Buffer<u16>>)
-        .arg(None::<&Buffer<u16>>)
-        .arg(&0)
-        .arg(&0)
-        .arg(&0)
-        .arg(&0)
-        .arg(&0)
-        .arg(&0)
-        .queue(queue.clone())
-        .build()?;
-    let matrix_mul_transposed_f16_cpu_optimized_program =
-        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_F16_CPU_OPTIMIZED_SRC)?;
-    let matrix_mul_transposed_f16_cpu_optimized = Kernel::builder()
-        .program(&matrix_mul_transposed_f16_cpu_optimized_program)
-        .name("matrix_mul_transposed_f16_cpu_optimized")
+    let matrix_mul_transposed_by_row_f16_program =
+        make_program_with_src(ctx, MATRIX_MUL_TRANSPOSED_BY_ROW_F16_SRC)?;
+    let matrix_mul_transposed_by_row_f16 = Kernel::builder()
+        .program(&matrix_mul_transposed_by_row_f16_program)
+        .name("matrix_mul_transposed_by_row_f16")
        .arg(None::<&Buffer<u16>>)
        .arg(None::<&Buffer<u16>>)
        .arg(None::<&Buffer<u16>>)
@ -456,71 +580,124 @@ fn make_programs(ctx: &Context, queue: &Queue) -> Result<Programs, OpenCLError>
        .arg(&0)
        .queue(queue.clone())
        .build()?;
+    let pow_f16_program = make_program_with_src(ctx, POW_F16_SRC)?;
+    let pow_f16 = Kernel::builder()
+        .program(&pow_f16_program)
+        .name("pow_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let mean_cols_f16_program = make_program_with_src(ctx, MEAN_COLS_F16_SRC)?;
+    let mean_cols_f16 = Kernel::builder()
+        .program(&mean_cols_f16_program)
+        .name("mean_cols_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let add_scalar_f16_program = make_program_with_src(ctx, ADD_SCALAR_F16_SRC)?;
+    let add_scalar_f16 = Kernel::builder()
+        .program(&add_scalar_f16_program)
+        .name("add_scalar_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let scalar_multiply_broadcast_f16_program =
+        make_program_with_src(ctx, SCALAR_MULTIPLY_BROADCAST_F16_SRC)?;
+    let scalar_multiply_broadcast_f16 = Kernel::builder()
+        .program(&scalar_multiply_broadcast_f16_program)
+        .name("scalar_multiply_broadcast_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let hadamard_product_broadcast_f16_program =
+        make_program_with_src(ctx, HADAMARD_PRODUCT_BROADCAST_F16_SRC)?;
+    let hadamard_product_broadcast_f16 = Kernel::builder()
+        .program(&hadamard_product_broadcast_f16_program)
+        .name("hadamard_product_broadcast_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let rsqrt_f16_program = make_program_with_src(ctx, RSQRT_F16_SRC)?;
+    let rsqrt_f16 = Kernel::builder()
+        .program(&rsqrt_f16_program)
+        .name("rsqrt_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
+    let add_f16_program = make_program_with_src(ctx, ADD_F16_SRC)?;
+    let add_f16 = Kernel::builder()
+        .program(&add_f16_program)
+        .name("add_f16")
+        .arg(None::<&Buffer<u16>>)
+        .arg(None::<&Buffer<u16>>)
+        .arg(&0)
+        .arg(&0)
+        .queue(queue.clone())
+        .build()?;
    Ok(Programs {
-        matrix_mul_transposed_f16_program,
-        matrix_mul_transposed_f16,
-        matrix_mul_transposed_f16_cpu_optimized_program,
-        matrix_mul_transposed_f16_cpu_optimized,
+        matrix_mul_transposed_by_row_f16_program,
+        matrix_mul_transposed_by_row_f16,
        silu_f16_program,
        silu_f16,
        hadamard_product_f16_program,
        hadamard_product_f16,
        transpose_f16_program,
        transpose_f16,
+        pow_f16_program,
+        pow_f16,
+        mean_cols_f16_program,
+        mean_cols_f16,
+        add_scalar_f16_program,
+        add_scalar_f16,
+        scalar_multiply_broadcast_f16_program,
+        scalar_multiply_broadcast_f16,
+        hadamard_product_broadcast_f16_program,
+        hadamard_product_broadcast_f16,
+        rsqrt_f16_program,
+        rsqrt_f16,
+        add_f16_program,
+        add_f16,
    })
 }

-const MATRIX_MUL_TRANSPOSED_F16_SRC: &str = r#"
+const MATRIX_MUL_TRANSPOSED_BY_ROW_F16_SRC: &str = r#"
 #pragma OPENCL EXTENSION cl_khr_fp16 : enable

-__kernel void matrix_mul_transposed_f16(
-    __global half *tgt,
-    __global const half *left,
-    __global const half *right,
-    const int left_cols_capacity,
-    const int right_cols_capacity,
-    const int ncols_capacity,
-    const int nrows,
-    const int ncols,  // size of target
-    const int shared_sz
-) {
-    __local float lefttile[16][16];
-    __local float righttile[16][16];
-
-    const int global_x = get_global_id(0);
-    const int global_y = get_global_id(1);
-    const int local_x = get_local_id(0);
-    const int local_y = get_local_id(1);
-    const int num_tiles = (shared_sz + 15) / 16;
-
-    float sum = 0.0f;
-    for (int t = 0; t < num_tiles; ++t) {
-        if (global_y < nrows) {
-            lefttile[local_y][local_x] = vload_half(global_y * left_cols_capacity + t * 16 + local_x, left);
-        } else {
-            lefttile[local_y][local_x] = 0.0f;
-        }
-        if (global_x < ncols) {
-            righttile[local_y][local_x] = vload_half(global_x * right_cols_capacity + t * 16 + local_y, right);
-        } else {
-            righttile[local_y][local_x] = 0.0f;
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        for (int k = 0; k < 16; ++k) {
-            sum += lefttile[local_y][k] * righttile[k][local_x];
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-    }
-    if (global_x < ncols && global_y < nrows) {
-        vstore_half(sum, global_y * ncols_capacity + global_x, (__global half*) tgt);
-    }
-}
-"#;
-
-const MATRIX_MUL_TRANSPOSED_F16_CPU_OPTIMIZED_SRC: &str = r#"
-#pragma OPENCL EXTENSION cl_khr_fp16 : enable
-
-__kernel void matrix_mul_transposed_f16_cpu_optimized(
+/*
+ * Matrix multiplication with a transposed second matrix, using 16-bit floats.
+ *
+ * One work unit per row.
+ *
+ * Assumes that each row in the matrices are zero-padded so that there's space for 32 bytes (or 16
+ * halfs) of data and we don't need to care if our loops go over the bounds.
+ *
+ * Operations are done in float32.
+ *
+ * This thing is not very fast right now. I compared with PyTorch and this is like 20x slower. It
+ * is still much faster than CPU. Not sure PyTorch uses cuBlas but if we could get at least
+ * somewhere like 50% of that speed I would be happy.
+ *
+ * The OpenCL on CPU for Ryzen 3950X seems to easily beat my own AVX2 operations.
+ *
+ * TODO: need to read resources like https://cnugteren.github.io/tutorial/pages/page1.html to
+ * figure out how matrix multiply faster.
+ */
+__kernel void matrix_mul_transposed_by_row_f16(
    __global half *tgt,
    __global const half *left,
    __global const half *right,
@ -531,12 +708,14 @@ __kernel void matrix_mul_transposed_f16_cpu_optimized(
    const int ncols,  // size of target
    const int shared_sz
 ) {
-    const int tgt_col = get_global_id(0);
-    const int tgt_row = get_global_id(1);
    int col_iterations = shared_sz / 16;
    if (shared_sz % 16 != 0) {
-        col_iterations = col_iterations + 1;
+      col_iterations = col_iterations + 1;
    }
+
+    const int tgt_row = get_global_id(0);
+    const int tgt_col = get_global_id(1);
+
    float16 sum = 0;
    for (int col16 = 0; col16 < col_iterations; col16++) {
        const float16 left8 = vload_half16((tgt_row * left_cols_capacity)/16 + col16, (__global const half*) left);
@ -555,13 +734,17 @@ __kernel void matrix_mul_transposed_f16_cpu_optimized(
    float sum6 = sum.sa + sum.sb;
    float sum7 = sum.sc + sum.sd;
    float sum8 = sum.se + sum.sf;
+
    float sum11 = sum1 + sum2;
    float sum12 = sum3 + sum4;
    float sum13 = sum5 + sum6;
    float sum14 = sum7 + sum8;
+
    float sum21 = sum11 + sum12;
    float sum22 = sum13 + sum14;
+
    float total = sum21 + sum22;
+
    vstore_half(total, 0, (__global half*) &tgt[tgt_row * ncols_capacity + tgt_col]);
 }
 "#;
@ -615,3 +798,131 @@ __kernel void transpose_f16(__global half *tgt,
    vstore_half(val, tgt_row * ncols_capacity + tgt_col, (__global half*) tgt);
 }
 "#;
+
+/// Computes x^scalar for every f16 value in the tensor
+const POW_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void pow_f16(__global half *tgt,
+                      const int ncols_capacity,
+                      const float scalar)
+{
+    const int tgt_row = get_global_id(0);
+    const int tgt_col = get_global_id(1);
+    const float val = vload_half(tgt_row * ncols_capacity + tgt_col, (__global const half*) tgt);
+    const float result = pow(val, scalar);
+    vstore_half(result, tgt_row * ncols_capacity + tgt_col, (__global half*) tgt);
+}
+"#;
+
+/// Computes the mean of each column in a tensor
+const MEAN_COLS_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void mean_cols_f16(__global half *tgt,
+                            __global const half *left,
+                            const int ncols_capacity,
+                            const int left_cols_capacity,
+                            const int ncolumns)
+{
+    // global work group size is nrows x 1
+    const int row = get_global_id(0);
+
+    float16 src_value = 0.0;
+    for (int col16 = 0; col16 < left_cols_capacity; col16 += 16) {
+        const int actual_col = col16;
+        if (actual_col >= ncolumns) {
+            break;
+        }
+        src_value += vload_half16((row * left_cols_capacity)/16 + col16/16, (__global const half*) left);
+    }
+    float src_value_sum = src_value.s0 + src_value.s1 + src_value.s2 + src_value.s3 + src_value.s4 + src_value.s5 + src_value.s6 + src_value.s7 + src_value.s8 + src_value.s9 + src_value.sa + src_value.sb + src_value.sc + src_value.sd + src_value.se + src_value.sf;
+    src_value_sum = src_value_sum / (float) ncolumns;
+    vstore_half(src_value_sum, row * ncols_capacity, (__global half*) tgt);
+}
+"#;
+
+/// Adds a scalar to a tensor
+const ADD_SCALAR_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void add_scalar_f16(__global half *tgt, const int ncols_capacity, const float scalar)
+{
+    const int tgt_row = get_global_id(0);
+    const int tgt_col = get_global_id(1);
+    const float val = vload_half(tgt_row * ncols_capacity + tgt_col, (__global const half*) tgt);
+    const float result = val + scalar;
+    vstore_half(result, tgt_row * ncols_capacity + tgt_col, (__global half*) tgt);
+}
+"#;
+
+/// Adds scalars from a row vector to each row of a tensor
+const SCALAR_MULTIPLY_BROADCAST_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void scalar_multiply_broadcast_f16(__global half *tgt,
+                                            __global const half *left,
+                                            const int ncols_capacity,
+                                            const int left_cols_capacity)
+{
+    // global work group size is nrows x (ncols/16)
+    const int row = get_global_id(0);
+    const int col = get_global_id(1) * 16;
+
+    const float scalar = vload_half(row * left_cols_capacity, (__global const half*) left);
+
+    float16 src_value = vload_half16((row * ncols_capacity)/16 + col/16, (__global const half*) tgt) * scalar;
+    vstore_half16(src_value, (row * ncols_capacity)/16 + col/16, (__global half*) tgt);
+}
+"#;
+
+/// Does a hadamard product from a column vector to each column of a tensor
+const HADAMARD_PRODUCT_BROADCAST_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void hadamard_product_broadcast_f16(__global half *tgt,
+                                             __global const half *left,
+                                             const int ncols_capacity,
+                                             const int left_cols_capacity)
+{
+    // global work group size is nrows x (ncols/16)
+    const int row = get_global_id(0);
+    const int col16 = get_global_id(1) * 16;
+    const float16 product_value = vload_half16(col16/16, (__global const half*) left);
+    const float16 src_value = vload_half16((row * ncols_capacity)/16 + col16/16, (__global const half*) tgt);
+    const float16 result = src_value * product_value;
+    vstore_half16(result, (row * ncols_capacity)/16 + col16/16, (__global half*) tgt);
+}
+"#;
+
+/// Computes 1/sqrt(x) for each f16 value in the tensor
+const RSQRT_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void rsqrt_f16(__global half *tgt, const int ncols_capacity)
+{
+    const int tgt_row = get_global_id(0);
+    const int tgt_col = get_global_id(1);
+    const float val = vload_half(tgt_row * ncols_capacity + tgt_col, (__global const half*) tgt);
+    const float result = rsqrt(val);
+    vstore_half(result, tgt_row * ncols_capacity + tgt_col, (__global half*) tgt);
+}
+"#;
+
+/// Computes sum of two tensors
+const ADD_F16_SRC: &str = r#"
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+
+__kernel void add_f16(__global half *tgt,
+                     __global const half *left,
+                     const int tgt_ncols_capacity,
+                     const int left_ncols_capacity)
+{
+    const int tgt_row = get_global_id(0);
+    const int tgt_col = get_global_id(1);
+    const float tgt_v = vload_half(tgt_row * tgt_ncols_capacity + tgt_col, (__global const half*) tgt);
+    const float left_v = vload_half(tgt_row * left_ncols_capacity + tgt_col, (__global const half*) left);
+    const float result = tgt_v + left_v;
+    vstore_half(result, tgt_row * tgt_ncols_capacity + tgt_col, (__global half*) tgt);
+}
+"#;
--- a/src/token_sampler.rs
+++ b/src/token_sampler.rs
@ -1,13 +1,11 @@
 use crate::tensor::Tensor;
-use crate::tokenizer::{TokenId, Tokenizer};
+use crate::tokenizer::TokenId;
 use rand::Rng;
-use std::collections::BTreeMap;

 pub struct TokenSampler {
    temperature: f32,
    top_p: f32,
    top_k: usize,
-    repetition_penalty: f32,
 }

 impl Default for TokenSampler {
@ -19,11 +17,9 @@ impl Default for TokenSampler {
 impl TokenSampler {
    pub fn new() -> Self {
        Self {
-            temperature: 0.2,
+            temperature: 0.8,
            top_p: 1.0,
            top_k: 1, // same as argmax
-            repetition_penalty: 0.8, // 1.0 = no penalty. values above 1.0 make repetition
-                      // encouraged which can quickly devolve into repeating loop
        }
    }

@ -39,10 +35,6 @@ impl TokenSampler {
        self.top_k
    }

-    pub fn get_repetition_penalty(&self) -> f32 {
-        self.repetition_penalty
-    }
-
    pub fn temperature(self, temperature: f32) -> Self {
        Self {
            temperature,
@ -58,92 +50,20 @@ impl TokenSampler {
        Self { top_k, ..self }
    }

-    pub fn repetition_penalty(self, repetition_penalty: f32) -> Self {
-        Self {
-            repetition_penalty,
-            ..self
-        }
-    }
-
-    pub fn logits_to_btreemap(
-        &self,
-        logits: &Tensor,
-        tokenizer: &Tokenizer,
-    ) -> BTreeMap<String, f32> {
-        let mut result = BTreeMap::new();
-        for token_idx in 0..logits.rows() {
-            result.insert(
-                tokenizer.id_to_str(token_idx as TokenId).to_string(),
-                logits.get_f32(token_idx, 0),
-            );
-        }
-        result
-    }
-
-    pub fn sample(
-        &self,
-        logits: &Tensor,
-        _tokenizer: &Tokenizer,
-        existing_tokens: &[TokenId],
-    ) -> (TokenId, f32) {
-        let mut times_used: BTreeMap<TokenId, usize> = BTreeMap::new();
-        for token in existing_tokens {
-            times_used
-                .entry(*token)
-                .and_modify(|e| *e += 1)
-                .or_insert(1);
-        }
-
+    pub fn sample(&self, logits: &Tensor) -> TokenId {
        let nrows = logits.rows();
        assert!(logits.cols() == 1);
        let mut logits = logits.transpose();
        if self.temperature > 0.0 {
            logits = logits.scalar_multiply_f32(1.0 / self.temperature);
+            logits = logits.softmax();
        }

-        if self.repetition_penalty != 1.0 {
-            for token_idx in 0..logits.rows() {
-                if let Some(count) = times_used.get(&(token_idx as TokenId)) {
-                    let penalty = self.repetition_penalty.powf(*count as f32);
-                    logits.set_f32(0, token_idx, logits.get_f32(0, token_idx) * penalty);
-                }
-            }
-        }
-        let mut maxv: f32 = std::f32::NEG_INFINITY;
-        for token_idx in 0..logits.rows() {
-            let v = logits.get_f32(0, token_idx);
-            if v > maxv {
-                maxv = v;
-            }
-        }
-        // To numerically stabilize, remove maxv from all logits
-        // softmax(x + c) = softmax(x) where c is a constant, and we make use of htat
-        for token_idx in 0..logits.rows() {
-            logits.set_f32(0, token_idx, logits.get_f32(0, token_idx) - maxv);
-        }
-        logits = logits.softmax();
-
        let mut logitsf: Vec<(TokenId, f32)> = Vec::with_capacity(nrows as usize);
        for i in 0..nrows {
-            let score = logits.get_f32(0, i);
-            logitsf.push((i as TokenId, score));
+            logitsf.push((i as TokenId, logits.get_f32(0, i)));
        }
-        logitsf.sort_unstable_by(|a, b| {
-            match b.1.partial_cmp(&a.1) {
-                Some(c) => c,
-                None => {
-                    // Sort NaNs to bottom
-                    if b.1.is_nan() {
-                        std::cmp::Ordering::Less
-                    } else if a.1.is_nan() {
-                        return std::cmp::Ordering::Greater;
-                    } else {
-                        return std::cmp::Ordering::Equal;
-                    }
-                }
-            }
-        });
-
+        logitsf.sort_unstable_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        logitsf.truncate(self.top_k);
        let mut p_accum: f32 = 0.0;
        for (idx, v) in logitsf.iter().enumerate() {
@ -158,18 +78,14 @@ impl TokenSampler {
            total_p += v.1;
        }
        let mut rng = rand::thread_rng();
-        let p: f32 = if total_p > 0.0 {
-            rng.gen_range(0.0..=total_p)
-        } else {
-            0.0
-        };
+        let p: f32 = rng.gen_range(0.0..total_p);
        p_accum = 0.0;
        for v in logitsf.into_iter() {
            p_accum += v.1;
            if p_accum >= p {
-                return (v.0, v.1 / total_p);
+                return v.0;
            }
        }
-        (0, 0.0)
+        0
    }
 }
--- a/src/tokenizer.rs
+++ b/src/tokenizer.rs
@ -105,16 +105,6 @@ impl Tokenizer {
        panic!("id out of range");
    }

-    // Tries to find a token from dictionary.
-    pub fn str_to_id(&self, s: &str) -> Option<TokenId> {
-        for (piece_str, piece_info) in self.pieces.iter() {
-            if piece_str == s {
-                return Some(piece_info.idx as i32);
-            }
-        }
-        None
-    }
-
    // Converts a string to a Vec<&str>
    // You may want to use tokenize_to_ids instead.
    //
@ -131,23 +121,11 @@ impl Tokenizer {
            let mut best_candidate: &str = "";
            let mut best_candidate_len: usize = 0;
            let mut skip_s: &str = "";
-            // Specially recognize newline. Otherwise it matches something we don't actually
-            // want.
-            if s.starts_with('\n') {
-                if self.str_to_id("<0x0A>").is_some() {
-                    best_candidate = "<0x0A>";
-                    best_candidate_len = best_candidate.len();
-                    skip_s = &s[1..];
-                } else {
-                    best_candidate = "\\n";
-                }
-            } else {
-                for (piece_str, _piece_info) in self.pieces.iter() {
-                    if s.starts_with(piece_str) && best_candidate_len < piece_str.len() {
-                        best_candidate = piece_str;
-                        best_candidate_len = piece_str.len();
-                        skip_s = &s[piece_str.len()..];
-                    }
+            for (piece_str, _piece_info) in self.pieces.iter() {
+                if s.starts_with(piece_str) && best_candidate_len < piece_str.len() {
+                    best_candidate = piece_str;
+                    best_candidate_len = piece_str.len();
+                    skip_s = &s[piece_str.len()..];
                }
            }
            if best_candidate_len == 0 {
--- a/src/transformer.rs
+++ b/src/transformer.rs
@ -8,6 +8,7 @@ use crate::unpickler::UnpicklingError;
 use indicatif::ProgressBar;
 use num_complex::Complex;
 use rayon::prelude::*;
+use std::mem::drop;
 use std::path::Path;
 use std::sync::{Arc, RwLock};

@ -28,8 +29,6 @@ pub struct Transformer {
    output: Tensor,

    layers: Vec<TransformerBlock>,
-
-    data_settings: DataSettings,
 }

 // Clone is cheap
@ -40,9 +39,9 @@ pub struct DataSettings {
    #[cfg(feature = "opencl")]
    use_opencl_for_attention: bool,
    #[cfg(feature = "opencl")]
+    use_opencl_for_rmsnorm: bool,
+    #[cfg(feature = "opencl")]
    cl: Option<OpenCL>,
-
-    force_f16: bool,
 }

 // OpenCL is safe to send to threads but Rust doesn't know that
@ -55,15 +54,14 @@ impl DataSettings {
        DataSettings {
            use_opencl_for_feedforward: false,
            use_opencl_for_attention: false,
-            force_f16: false,
+            use_opencl_for_rmsnorm: false,
            cl: cl.clone(),
        }
    }

-    #[allow(clippy::new_without_default)]
    #[cfg(not(feature = "opencl"))]
    pub fn new() -> Self {
-        DataSettings { force_f16: false }
+        DataSettings {}
    }

    #[cfg(feature = "opencl")]
@ -73,11 +71,7 @@ impl DataSettings {
        }
        self.use_opencl_for_feedforward = true;
        self.use_opencl_for_attention = true;
-        self
-    }
-
-    pub fn force_f16(mut self) -> DataSettings {
-        self.force_f16 = true;
+        self.use_opencl_for_rmsnorm = true;
        self
    }
 }
@ -96,59 +90,25 @@ pub struct TransformerBlock {
 pub struct AttentionCache {
    cache_k: Vec<Arc<RwLock<Tensor>>>,
    cache_v: Vec<Arc<RwLock<Tensor>>>,
-    data_settings: DataSettings,
 }

 impl AttentionCache {
-    fn new(
-        max_seq_len: usize,
-        n_local_heads: usize,
-        head_dim: usize,
-        data_settings: &DataSettings,
-    ) -> Self {
+    fn new(max_seq_len: usize, n_local_heads: usize, head_dim: usize) -> Self {
        let mut cache_k = Vec::with_capacity(n_local_heads);
        let mut cache_v = Vec::with_capacity(n_local_heads);
-
-        let dtype = if data_settings.force_f16 {
-            TensorDType::Float16
-        } else {
-            TensorDType::Float32
-        };
        for _ in 0..n_local_heads {
            cache_k.push(Arc::new(RwLock::new(Tensor::zeros(
                head_dim as i64,
                max_seq_len as i64,
-                dtype,
+                TensorDType::Float32,
            ))));
            cache_v.push(Arc::new(RwLock::new(Tensor::zeros(
                head_dim as i64,
                max_seq_len as i64,
-                dtype,
+                TensorDType::Float32,
            ))));
        }
-        AttentionCache {
-            cache_k,
-            cache_v,
-            data_settings: data_settings.clone(),
-        }
-    }
-
-    /// Cloning AttentionCache normally just makes new references to the same cache.
-    /// This creates a true clone with copied tensors.
-    fn true_clone(&self) -> AttentionCache {
-        let mut cache_k = Vec::with_capacity(self.cache_k.len());
-        let mut cache_v = Vec::with_capacity(self.cache_v.len());
-        for idx in 0..self.cache_k.len() {
-            let old_k = self.cache_k[idx].read().unwrap();
-            cache_k.push(Arc::new(RwLock::new(old_k.clone())));
-            let old_v = self.cache_v[idx].read().unwrap();
-            cache_v.push(Arc::new(RwLock::new(old_v.clone())));
-        }
-        AttentionCache {
-            cache_k,
-            cache_v,
-            data_settings: self.data_settings.clone(),
-        }
+        AttentionCache { cache_k, cache_v }
    }

    fn shift_left(&mut self, shifts: usize) {
@ -177,22 +137,14 @@ impl TransformerCaches {
            layer.shift_left(shifts);
        }
    }
-
-    pub fn true_clone(&self) -> TransformerCaches {
-        let mut layer_caches = Vec::with_capacity(self.layer_caches.len());
-        for layer in self.layer_caches.iter() {
-            layer_caches.push(layer.true_clone());
-        }
-        TransformerCaches { layer_caches }
-    }
 }

 pub struct RMSNorm {
    eps: f64,
    weight: Tensor,
+    data_settings: DataSettings,
 }

-#[allow(dead_code)]
 pub struct Attention {
    wq: Tensor,
    wk: Tensor,
@ -249,9 +201,15 @@ impl Transformer {
                result
            })
            .collect::<Result<Vec<TransformerBlock>, UnpicklingError>>()?;
-        std::mem::drop(progress_bar);
+        drop(progress_bar);

-        let norm = RMSNorm::from_unpickled(unpickled, "norm.weight".to_string(), eps, data_dir)?;
+        let norm = RMSNorm::from_unpickled(
+            unpickled,
+            "norm.weight".to_string(),
+            eps,
+            data_settings.clone(),
+            data_dir,
+        )?;
        let output = Tensor::from_unpickled_pieces(
            unpickled,
            "output.weight",
@ -262,7 +220,6 @@ impl Transformer {

        Ok(Transformer {
            freqs_cis: compute_freqs_cis(dim / n_heads, max_seq_len, 10000.0),
-            data_settings: data_settings.clone(),
            emb,
            dim,
            n_layers,
@ -285,7 +242,6 @@ impl Transformer {
                self.max_seq_len,
                self.n_local_heads,
                self.head_dim,
-                &self.data_settings,
            ));
        }
        TransformerCaches {
@ -317,18 +273,23 @@ impl Transformer {
            embs.push(emb);
        }
        let mut emb_tensor: Tensor = Tensor::concat(&embs);
-        std::mem::drop(embs);
+        drop(embs);

        for (idx, layer) in self.layers.iter().enumerate() {
            emb_tensor = layer.forward(
-                &emb_tensor,
+                &mut emb_tensor,
                start_pos,
                &self.freqs_cis,
                &mask,
                &mut caches.layer_caches[idx],
            );
        }
-        let out = self.norm.forward(&emb_tensor);
+        let mut out = self.norm.forward(&mut emb_tensor);
+        #[cfg(feature = "opencl")]
+        if out.is_on_gpu() {
+            out.to_cpu().unwrap();
+            out = out.to_f32();
+        }
        let out = out.row(out.rows() - 1);

        self.output.matrix_mul_transposed(&out)
@ -352,19 +313,21 @@ impl TransformerBlock {
            layer_id,
            n_local_heads,
            head_dim,
-            data_settings,
+            data_settings.clone(),
            data_dir,
        )?;
        let ffn_norm = RMSNorm::from_unpickled(
            unpickled,
            format!("layers.{}.ffn_norm.weight", layer_id),
            eps,
+            data_settings.clone(),
            data_dir,
        )?;
        let attn_norm = RMSNorm::from_unpickled(
            unpickled,
            format!("layers.{}.attention_norm.weight", layer_id),
            eps,
+            data_settings.clone(),
            data_dir,
        )?;
        Ok(Self {
@ -377,26 +340,61 @@ impl TransformerBlock {

    pub fn forward(
        &self,
-        x: &Tensor,
+        x: &mut Tensor,
        start_pos: usize,
        freqs_cis: &FreqsCis,
        mask: &Option<Tensor>,
        attention_cache: &mut AttentionCache,
    ) -> Tensor {
+        let now = std::time::Instant::now();
        let mut attnorm_out = self.attention_norm.forward(x);
-        let att_out = self.attn.forward(
+        let now = std::time::Instant::now();
+        let mut att_out = self.attn.forward(
            &mut attnorm_out,
            start_pos,
            freqs_cis,
            mask,
            attention_cache,
        );
-        std::mem::drop(attnorm_out);
+        let now = std::time::Instant::now();
+        drop(attnorm_out);

-        let h = x.add(&att_out);
-        let mut att_out = self.ffn_norm.forward(&h);
+        #[cfg(feature = "opencl")]
+        let mut x_was_on_cpu: bool;
+        #[cfg(feature = "opencl")]
+        {
+            x_was_on_cpu = x.is_on_cpu();
+            if x_was_on_cpu {
+                *x = x.to_f16();
+                x.to_gpu(self.attention_norm.data_settings.cl.as_ref().unwrap())
+                    .unwrap();
+            }
+            if x.is_on_gpu() {
+                att_out = att_out.to_f16();
+                att_out
+                    .to_gpu(self.attention_norm.data_settings.cl.as_ref().unwrap())
+                    .unwrap();
+            }
+        }
+        let mut h = x.add(&att_out);
+        let now = std::time::Instant::now();
+        let mut att_out = self.ffn_norm.forward(&mut h);
+        let now = std::time::Instant::now();
        let att_out = self.feed_forward.forward(&mut att_out).transpose();
-        h.add(&att_out)
+        let mut result = h.add(&att_out);
+        #[cfg(feature = "opencl")]
+        {
+            if x_was_on_cpu {
+                result.to_cpu().unwrap();
+                return result.to_f32();
+            } else {
+                result
+            }
+        }
+        #[cfg(not(feature = "opencl"))]
+        {
+            result
+        }
    }
 }

@ -405,27 +403,64 @@ impl RMSNorm {
        unpickled: &[unpickler::Value],
        name: String,
        eps: f64,
+        data_settings: DataSettings,
        data_dir: P,
    ) -> Result<RMSNorm, UnpicklingError> {
        let data_dir: &Path = data_dir.as_ref();
-        let weights = Tensor::from_unpickled_pieces(
+        let mut weights = Tensor::from_unpickled_pieces(
            &unpickled[0..=0],
-            name,
+            name.clone(),
            data_dir,
            FromPiecesDirection::Rows,
-        )?
-        .to_f32();
+        )?;
+
+        #[cfg(feature = "opencl")]
+        {
+            if data_settings.use_opencl_for_rmsnorm {
+                weights = weights.to_f16();
+                let ds = data_settings.clone();
+                weights.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+            } else {
+                weights = weights.to_f32();
+            }
+        }
+        #[cfg(not(feature = "opencl"))]
+        {
+            weights = weights.to_f32();
+        }

        Ok(Self {
            eps,
            weight: weights,
+            data_settings,
        })
    }

-    fn forward(&self, x: &Tensor) -> Tensor {
+    fn forward(&self, x: &mut Tensor) -> Tensor {
+        #[cfg(feature = "opencl")]
+        let x_was_on_cpu: bool;
+        #[cfg(feature = "opencl")]
+        {
+            x_was_on_cpu = x.is_on_cpu();
+            if self.data_settings.use_opencl_for_rmsnorm && x_was_on_cpu {
+                *x = x.to_f16();
+                x.to_gpu(self.data_settings.cl.as_ref().unwrap()).unwrap();
+            }
+        }
        let inner = x.pow(2.0).mean_cols().add_scalar(self.eps as f32);
        let out1 = x.scalar_multiply_broadcast(&inner.rsqrt());
-        out1.hadamard_product_broadcast(&self.weight)
+        let mut result = out1.hadamard_product_broadcast(&self.weight);
+        #[cfg(feature = "opencl")]
+        {
+            if x_was_on_cpu {
+                result.to_cpu().unwrap();
+            }
+            result
+        }
+        #[cfg(not(feature = "opencl"))]
+        {
+            result
+        }
    }
 }

@ -457,12 +492,6 @@ impl FeedForward {
            FromPiecesDirection::Rows,
        )?;

-        if data_settings.force_f16 {
-            w1 = w1.to_f16();
-            w2 = w2.to_f16();
-            w3 = w3.to_f16();
-        }
-
        #[cfg(feature = "opencl")]
        {
            if data_settings.use_opencl_for_feedforward {
@ -470,12 +499,21 @@ impl FeedForward {
                w2 = w2.to_f16();
                w3 = w3.to_f16();
                let ds = data_settings.clone();
-                w1.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                w2.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                w3.to_gpu_inplace(&ds.cl.unwrap()).unwrap();
+                w1.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                w2.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                w3.to_gpu(&ds.cl.unwrap()).unwrap();
+            } else {
+                w1 = w1.to_f32();
+                w2 = w2.to_f32();
+                w3 = w3.to_f32();
            }
        }
-        // w1, w2, w3 maybe be f32 or f16 depending on source data.
+        #[cfg(not(feature = "opencl"))]
+        {
+            w1 = w1.to_f32();
+            w2 = w2.to_f32();
+            w3 = w3.to_f32();
+        }

        Ok(Self {
            w1,
@ -486,46 +524,36 @@ impl FeedForward {
    }

    pub fn forward(&self, x: &mut Tensor) -> Tensor {
-        let original_x_dtype = x.dtype();
-        if x.dtype() != self.w1.dtype() {
-            *x = x.to_same_type(&self.w1);
-        }
        #[cfg(feature = "opencl")]
        let x_was_on_cpu: bool;
        #[cfg(feature = "opencl")]
        {
            x_was_on_cpu = x.is_on_cpu();
-            if self.data_settings.use_opencl_for_feedforward {
-                x.to_gpu_inplace(self.data_settings.cl.as_ref().unwrap())
-                    .unwrap();
+            if self.data_settings.use_opencl_for_feedforward && x_was_on_cpu {
+                *x = x.to_f16();
+                x.to_gpu(self.data_settings.cl.as_ref().unwrap()).unwrap();
            }
        }
-        let (mut w1_out, mut w3_out) = rayon::join(
+        let (w1_out, w3_out) = rayon::join(
            || self.w1.matrix_mul_transposed(x),
            || self.w3.matrix_mul_transposed(x),
        );
-
-        // Float16 not supported for some of these ops on CPU.
-        if w1_out.is_on_cpu() && w1_out.dtype() == TensorDType::Float16 {
-            w1_out = w1_out.to_f32();
-            w3_out = w3_out.to_f32();
-        }
        let w1_out = w1_out.silu();
-        let mut w1w3_out = w1_out.hadamard_product(&w3_out).transpose();
-        if w1w3_out.dtype() != self.w2.dtype() {
-            w1w3_out = w1w3_out.to_same_type(&self.w2);
-        }
+        let w1w3_out = w1_out.hadamard_product(&w3_out).transpose();
+
        #[cfg(not(feature = "opencl"))]
-        {
-            self.w2
-                .matrix_mul_transposed(&w1w3_out)
-                .into_dtype(original_x_dtype)
+        if w1w3_out.rows() == 1 {
+            return self
+                .w2
+                .matrix_vector_mul_transposed_multithreaded(&w1w3_out);
+        } else {
+            return self.w2.matrix_mul_transposed(&w1w3_out);
        }
        #[cfg(feature = "opencl")]
        {
            let mut result = self.w2.matrix_mul_transposed(&w1w3_out);
            if x_was_on_cpu {
-                result.to_cpu_inplace().unwrap();
+                result.to_cpu().unwrap();
                result
            } else {
                result
@ -570,13 +598,6 @@ impl Attention {
            FromPiecesDirection::Cols,
        )?;

-        if data_settings.force_f16 {
-            wq = wq.to_f16();
-            wk = wk.to_f16();
-            wv = wv.to_f16();
-            wo = wo.to_f16();
-        }
-
        #[cfg(feature = "opencl")]
        {
            if data_settings.use_opencl_for_attention {
@ -585,12 +606,24 @@ impl Attention {
                wv = wv.to_f16();
                wo = wo.to_f16();
                let ds = data_settings.clone();
-                wq.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wk.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wv.to_gpu_inplace(&ds.cl.as_ref().unwrap().clone()).unwrap();
-                wo.to_gpu_inplace(&ds.cl.unwrap()).unwrap();
+                wq.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wk.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wv.to_gpu(&ds.cl.as_ref().unwrap().clone()).unwrap();
+                wo.to_gpu(&ds.cl.unwrap()).unwrap();
+            } else {
+                wq = wq.to_f32();
+                wk = wk.to_f32();
+                wv = wv.to_f32();
+                wo = wo.to_f32();
            }
        }
+        #[cfg(not(feature = "opencl"))]
+        {
+            wq = wq.to_f32();
+            wk = wk.to_f32();
+            wv = wv.to_f32();
+            wo = wo.to_f32();
+        }

        Ok(Self {
            wq,
@ -611,19 +644,14 @@ impl Attention {
        mask: &Option<Tensor>,
        attention_cache: &mut AttentionCache,
    ) -> Tensor {
-        let original_x_dtype = x.dtype();
-        if x.dtype() != self.wq.dtype() {
-            *x = x.to_same_type(&self.wq);
-        }
-
        #[cfg(feature = "opencl")]
        let x_was_on_cpu: bool;
        #[cfg(feature = "opencl")]
        {
            x_was_on_cpu = x.is_on_cpu();
-            if self.data_settings.use_opencl_for_attention {
-                x.to_gpu_inplace(self.data_settings.cl.as_ref().unwrap())
-                    .unwrap();
+            if self.data_settings.use_opencl_for_attention && x_was_on_cpu {
+                *x = x.to_f16();
+                x.to_gpu(self.data_settings.cl.as_ref().unwrap()).unwrap();
            }
        }

@ -633,19 +661,19 @@ impl Attention {
            let mut xq_out = x.matrix_mul_transposed(&self.wq);
            let mut xk_out = x.matrix_mul_transposed(&self.wk);
            let mut xv_out = x.matrix_mul_transposed(&self.wv);
-            xq_out.to_cpu_inplace().unwrap();
-            xk_out.to_cpu_inplace().unwrap();
-            xv_out.to_cpu_inplace().unwrap();
+            xq_out.to_cpu().unwrap();
+            xk_out.to_cpu().unwrap();
+            xv_out.to_cpu().unwrap();
            (xq_out.to_f32(), xk_out.to_f32(), xv_out.to_f32())
        };

        #[cfg(not(feature = "opencl"))]
        let (xq_out, (xk_out, xv_out)) = rayon::join(
-            || x.matrix_mul_transposed(&self.wq).to_f32(),
+            || x.matrix_mul_transposed(&self.wq),
            || {
                rayon::join(
-                    || x.matrix_mul_transposed(&self.wk).to_f32(),
-                    || x.matrix_mul_transposed(&self.wv).to_f32(),
+                    || x.matrix_mul_transposed(&self.wk),
+                    || x.matrix_mul_transposed(&self.wv),
                )
            },
        );
@ -711,9 +739,6 @@ impl Attention {
                let keys = cache_k.clip_cols(start_pos + seq_len as usize);
                let values = cache_v.clip_cols(start_pos + seq_len as usize);

-                let keys = keys.into_same_type(&xq_row);
-                let values = values.into_same_type(&xq_row);
-
                let m = xq_row
                    .matrix_mul(&keys)
                    .scalar_multiply_f32(1.0 / (self.head_dim as f32).sqrt());
@ -740,9 +765,7 @@ impl Attention {
                #[cfg(not(feature = "opencl"))]
                {
                    let xq_row = Tensor::concat(&concat_vec2).view(1, self.wo.rows());
-                    xq_row
-                        .into_same_type(&self.wo)
-                        .matrix_mul_transposed(&self.wo)
+                    xq_row.matrix_mul_transposed(&self.wo)
                }
                #[cfg(feature = "opencl")]
                {
@ -751,10 +774,10 @@ impl Attention {
                        .to_f16();
                    if self.wo.is_on_gpu() {
                        xq_row
-                            .to_gpu_inplace(&self.data_settings.cl.as_ref().unwrap())
+                            .to_gpu(&self.data_settings.cl.as_ref().unwrap())
                            .unwrap();
                        let mut result = xq_row.matrix_mul_transposed(&self.wo);
-                        result.to_cpu_inplace().unwrap();
+                        result.to_cpu().unwrap();
                        result.to_f32()
                    } else {
                        xq_row.matrix_mul_transposed(&self.wo)
@ -764,8 +787,21 @@ impl Attention {
            .collect();

        let output3: Vec<&Tensor> = output2.iter().collect();
-        let output2: Tensor = Tensor::concat(&output3);
-        output2.into_dtype(original_x_dtype)
+        let mut output2: Tensor = Tensor::concat(&output3);
+
+        #[cfg(feature = "opencl")]
+        {
+            if x_was_on_cpu {
+                output2.to_cpu().unwrap();
+                return output2.to_f32();
+            } else {
+                return output2;
+            }
+        }
+        #[cfg(not(feature = "opencl"))]
+        {
+            output2
+        }
    }
 }

--- a/src/unpickler.rs
+++ b/src/unpickler.rs
@ -587,7 +587,7 @@ pub fn unpickle(bytes: &[u8]) -> Result<Value, UnpicklingError> {
                ));
            }
            let idx = u32::from_le_bytes([bytes[1], bytes[2], bytes[3], bytes[4]]);
-            match memo.get(&{ idx }) {
+            match memo.get(&(idx as u32)) {
                None => {
                    return Err(UnpicklingError::UnpicklingError(
                        "LONG_BINGET index out of range".to_string(),
--- a/src/weight_compression.rs
+++ b/src/weight_compression.rs
@ -1,65 +0,0 @@
-use crate::tensor::Tensor;
-use rand::{thread_rng, Rng};
-use rayon::prelude::*;
-use std::sync::atomic::{AtomicUsize, Ordering};
-use std::sync::{Arc, RwLock};
-
-pub fn quantize(tensor: &Tensor) -> Tensor {
-    /*
-     * This is a simplistic rounding quantizer. It splits each row in a tensor to 16 buckets and
-     * takes the average value in said buckets as the quantized weight.
-     */
-    let mut result = Tensor::zeros(tensor.rows(), tensor.cols(), tensor.dtype());
-    for row in 0..tensor.rows() {
-        let mut values: Vec<f32> = Vec::with_capacity(tensor.cols() as usize);
-        if row % 500 == 0 {
-            println!("{}", row,);
-        }
-        values.truncate(0);
-        let mut mi: f32 = std::f32::MAX;
-        let mut ma: f32 = std::f32::MIN;
-
-        for col in 0..tensor.cols() {
-            let val = tensor.get_f32(row, col);
-            if val < mi {
-                mi = val;
-            }
-            if val > ma {
-                ma = val;
-            }
-            values.push(val);
-        }
-        values.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
-        let mut allowed_values: Vec<f32> = Vec::with_capacity(16);
-        let mut rng = thread_rng();
-        for i in 0..16 {
-            let start_idx = i * values.len() / 16;
-            let end_idx = (i + 1) * values.len() / 16;
-
-            let mut avg = 0.0;
-            for j in start_idx..end_idx {
-                avg += values[j];
-            }
-            avg /= (end_idx - start_idx) as f32;
-            allowed_values.push(avg);
-        }
-        allowed_values[0] = mi;
-        allowed_values[15] = ma;
-        allowed_values.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
-
-        for col in 0..tensor.cols() {
-            let val = tensor.get_f32(row, col);
-            let mut best = 0;
-            let mut best_dist = std::f32::MAX;
-            for i in 0..16 {
-                let dist = (val - allowed_values[i] as f32).abs();
-                if dist < best_dist {
-                    best = i;
-                    best_dist = dist;
-                }
-            }
-            result.set_f32(row, col, allowed_values[best] as f32);
-        }
-    }
-    result
-}