d7d13cd474Bucketize the 4-bit quantization for more accuracy.
k4bit
Mikko Juola
2023-03-23 02:11:23 -0700
8cc82ae7e2Make separate matrix_vector_muls for 4-bit quantization rather than using matrix_mul for them.
Mikko Juola
2023-03-22 22:56:08 -0700
2f3e9bc0f5K4 bit inference works now. Performance isn't as good as I'd like it to be though.
Mikko Juola
2023-03-22 21:57:18 -0700
40121e1c82Multithread the k4 * f32 matrix multiplication.
Mikko Juola
2023-03-22 21:13:13 -0700
b8946da2d8Implement matrix multiplication for 4-bit * 32-bit floats.
Mikko Juola
2023-03-22 21:05:14 -0700
f6249e8d9fAdd skeleton code for 4-bit quantization.
master
Mikko Juola
2023-03-21 01:14:48 -0700
26f343ad15Add a flag that will exit the HTTP server after just one query.
Mikko Juola
2023-03-20 19:14:35 -0700
957a8f9f98Mention that `server` feature must be turned on to use the inference API.
Mikko Juola
2023-03-20 19:02:10 -0700
5e241722cbFix compilation when opencl feature is being used.
Mikko Juola
2023-03-20 18:30:41 -0700
d85ed7f23eMention HTTP server in features in README.md
Mikko Juola
2023-03-20 18:29:44 -0700
a8320613a1Fix some things in README.md after proofreading it and removing lies.
Mikko Juola
2023-03-20 18:28:27 -0700
b9be485610Add simple HTTP API support.
Mikko Juola
2023-03-20 18:24:02 -0700
9c86c17318Refactor all SIMD to one file, simd_support.rs
Mikko Juola
2023-03-18 15:25:12 -0700
25e3e12d9dUpdate README.md on LLaMA-65B benchmark result.
Mikko Juola
2023-03-18 09:52:09 -0700
f233f8ad8fForgot to mark last benchmark at March 17
Mikko Juola
2023-03-18 01:34:40 -0700
db0f22ed26Update README.md, add a nice animation.
Mikko Juola
2023-03-18 01:30:32 -0700
cfad4b1205Bump version to 0.3.0
Mikko Juola
2023-03-18 00:55:35 -0700
016b609481More install instructions.
Mikko Juola
2023-03-18 00:55:03 -0700
2666571e2bUpdate README.md to show `rllama` is on crates.io now.
Mikko Juola
2023-03-18 00:51:36 -0700
58b61cba39Bump version to 0.2.0
Mikko Juola
2023-03-18 00:50:45 -0700
ebdea727fdDon't let the crate be built without avx2, avx, etc. or it'll be very slow.
Mikko Juola
2023-03-18 00:48:44 -0700
f2c38a272fUpdate Cargo dependencies.
Mikko Juola
2023-03-18 00:15:22 -0700
91dee4f114Add --quiet flag, make colors respect --quiet so you just get the output and nothing else.
Mikko Juola
2023-03-17 23:58:04 -0700
109171b50eMention that this is AMD64 only because of AVX2.
Mikko Juola
2023-03-17 23:50:23 -0700
ff349eeea0Make number of threads configurable and obtained by default from the system rather than hardcoding to 32.
Mikko Juola
2023-03-17 23:48:43 -0700
44e0abf0f1Clarify that the OpenCL implementations all use f16.
Mikko Juola
2023-03-17 23:43:04 -0700
58463458eePut benchmarks on top of README.md.
Mikko Juola
2023-03-17 23:40:38 -0700
882ff05254Update README.md for new benchmarks.
Mikko Juola
2023-03-17 23:33:04 -0700
3d0afcf243Make matrix multiplication multithreaded.
Mikko Juola
2023-03-17 23:25:19 -0700
8134c20d57We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.
Mikko Juola
2023-03-17 22:42:33 -0700
1f5e687298Modest improvement to f16 matrix_vector_mul_transposed without OpenCL.
Mikko Juola
2023-03-17 14:03:56 -0700
acfd6bd5bdAdd f16, non-OpenCL version of matrix_vector_mul_transposed as well.
Mikko Juola
2023-03-17 13:26:58 -0700
baecd25ee3Add f16 version of matrix multiplication that works without any OpenCL.
Mikko Juola
2023-03-17 13:07:15 -0700
a1970b8a9cImprove matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.
Mikko Juola
2023-03-17 11:04:35 -0700
61bc42b728Improve the handwritten AVX2 for matrix_mul_inplace_transposed.
Mikko Juola
2023-03-16 08:53:31 -0700
0cce655763Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation.
Mikko Juola
2023-03-16 08:34:08 -0700
09f76dfcfaUpdate README.md opening with new benchmark numbers.
Mikko Juola
2023-03-15 12:47:32 -0700
4b8accee44Update benchmarks.
Mikko Juola
2023-03-15 12:17:36 -0700
de5dd59277Some code cleanup in OpenCL.
Mikko Juola
2023-03-15 12:07:57 -0700
8aef5d8831Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.
Mikko Juola
2023-03-15 11:45:15 -0700
1c5ec04217Add a different kernel to be used when OpenCL device is a CPU.
Mikko Juola
2023-03-15 01:50:00 -0700
8c64313fecRewrite the matrix multiplication.
Mikko Juola
2023-03-15 01:24:32 -0700
862d4a15d6Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable.
Mikko Juola
2023-03-14 00:40:08 -0700
f4629ca987Respect the stop token from the model.
Mikko Juola
2023-03-13 22:38:20 -0700
de477314edFix newlines not recognized when feeding newlines in the prompt.
Mikko Juola
2023-03-13 22:33:46 -0700
687bbf1249Add instructions on how to use OpenCL in the README.md
Mikko Juola
2023-03-13 21:59:45 -0700
8de18bdc77Add screenshot to README.md.
Mikko Juola
2023-03-13 21:47:20 -0700
a2e88c1193Update README.md
Mikko Juola
2023-03-13 21:42:03 -0700
17b9d90570Add lots of code I added for OpenCL, but text generation got broken and I have no idea why.
broken-opencl-code
Mikko Juola
2023-03-13 21:27:04 -0700
b4d5cf91a7Mention in README.md that using OpenCL does not cast weights to 32-bit floats.
Mikko Juola
2023-03-13 17:45:23 -0700
99da6ed71aUpdate README.md benchmarks for new attention OpenCL thing.
Mikko Juola
2023-03-13 17:44:03 -0700
35b0c372a8Implement some attention operations for OpenCL.
Mikko Juola
2023-03-13 17:38:12 -0700
6e456e64f3Add new benchmarks now that this is partially OpenCLified.
Mikko Juola
2023-03-13 17:17:09 -0700
63d27dba90Add partial OpenCL support, it's used in feed forward network only.
Mikko Juola
2023-03-13 17:11:00 -0700
df079bceb0Add records of my benchmarks to README.md so I can compare it later.
Mikko Juola
2023-03-13 13:05:32 -0700
c9c861d199Add some measurements so we can get tokens per second.
Mikko Juola
2023-03-13 12:59:07 -0700
22792b26ccAdd an idea about on-disk cache for initial prompt processing (not for weights).
Mikko Juola
2023-03-13 12:45:16 -0700
9087c50efaAdd notes about improving sampler to README.md
Mikko Juola
2023-03-13 12:41:57 -0700
1a88482988Add some OpenCL bits.
Mikko Juola
2023-03-13 12:33:21 -0700
a92017bf56Add some initial OpenCL stuff.
Mikko Juola
2023-03-12 01:20:17 -0800
53d367e6faAdd some beginnings of OpenCL implementation.
Mikko Juola
2023-03-12 00:35:54 -0800
846759b277Optimize conversions to and from f16<->32.
Mikko Juola
2023-03-11 23:21:00 -0800
8acb9f32b8Update README.md for new discoveries.
Mikko Juola
2023-03-11 22:55:08 -0800
26d5309cf7Add support for bigger models.
Mikko Juola
2023-03-11 21:50:59 -0800
8a427bcb21The project is actually called rllama, put that in readme.md.
Mikko Juola
2023-03-11 12:01:55 -0800
18ef805458Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable.
Mikko Juola
2023-03-11 10:44:06 -0800
f103871bc0Make the output colored. This is essential to be taken seriously.
Mikko Juola
2023-03-11 10:21:08 -0800
cd28aba5e2Make the output look nicer.
Mikko Juola
2023-03-11 03:03:50 -0800
d7a3f57510Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file.
Mikko Juola
2023-03-11 02:46:21 -0800
8bb9404168Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff.
Mikko Juola
2023-03-11 00:47:32 -0800
f6217e0036Add readme, make clippy happy.
Mikko Juola
2023-03-11 00:40:28 -0800
3b8f904f13First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay.
Mikko Juola
2023-03-11 00:31:40 -0800