rllama

d7d13cd474 Bucketize the 4-bit quantization for more accuracy. k4bit Mikko Juola 2023-03-23 02:11:23 -0700
8cc82ae7e2 Make separate matrix_vector_muls for 4-bit quantization rather than using matrix_mul for them. Mikko Juola 2023-03-22 22:56:08 -0700
2f3e9bc0f5 K4 bit inference works now. Performance isn't as good as I'd like it to be though. Mikko Juola 2023-03-22 21:57:18 -0700
40121e1c82 Multithread the k4 * f32 matrix multiplication. Mikko Juola 2023-03-22 21:13:13 -0700
b8946da2d8 Implement matrix multiplication for 4-bit * 32-bit floats. Mikko Juola 2023-03-22 21:05:14 -0700
f6249e8d9f Add skeleton code for 4-bit quantization. master Mikko Juola 2023-03-21 01:14:48 -0700
26f343ad15 Add a flag that will exit the HTTP server after just one query. Mikko Juola 2023-03-20 19:14:35 -0700
957a8f9f98 Mention that `server` feature must be turned on to use the inference API. Mikko Juola 2023-03-20 19:02:10 -0700
5e241722cb Fix compilation when opencl feature is being used. Mikko Juola 2023-03-20 18:30:41 -0700
d85ed7f23e Mention HTTP server in features in README.md Mikko Juola 2023-03-20 18:29:44 -0700
a8320613a1 Fix some things in README.md after proofreading it and removing lies. Mikko Juola 2023-03-20 18:28:27 -0700
b9be485610 Add simple HTTP API support. Mikko Juola 2023-03-20 18:24:02 -0700
9c86c17318 Refactor all SIMD to one file, simd_support.rs Mikko Juola 2023-03-18 15:25:12 -0700
25e3e12d9d Update README.md on LLaMA-65B benchmark result. Mikko Juola 2023-03-18 09:52:09 -0700
f233f8ad8f Forgot to mark last benchmark at March 17 Mikko Juola 2023-03-18 01:34:40 -0700
db0f22ed26 Update README.md, add a nice animation. Mikko Juola 2023-03-18 01:30:32 -0700
cfad4b1205 Bump version to 0.3.0 Mikko Juola 2023-03-18 00:55:35 -0700
016b609481 More install instructions. Mikko Juola 2023-03-18 00:55:03 -0700
2666571e2b Update README.md to show `rllama` is on crates.io now. Mikko Juola 2023-03-18 00:51:36 -0700
58b61cba39 Bump version to 0.2.0 Mikko Juola 2023-03-18 00:50:45 -0700
ebdea727fd Don't let the crate be built without avx2, avx, etc. or it'll be very slow. Mikko Juola 2023-03-18 00:48:44 -0700
f2c38a272f Update Cargo dependencies. Mikko Juola 2023-03-18 00:15:22 -0700
91dee4f114 Add --quiet flag, make colors respect --quiet so you just get the output and nothing else. Mikko Juola 2023-03-17 23:58:04 -0700
109171b50e Mention that this is AMD64 only because of AVX2. Mikko Juola 2023-03-17 23:50:23 -0700
ff349eeea0 Make number of threads configurable and obtained by default from the system rather than hardcoding to 32. Mikko Juola 2023-03-17 23:48:43 -0700
44e0abf0f1 Clarify that the OpenCL implementations all use f16. Mikko Juola 2023-03-17 23:43:04 -0700
58463458ee Put benchmarks on top of README.md. Mikko Juola 2023-03-17 23:40:38 -0700
882ff05254 Update README.md for new benchmarks. Mikko Juola 2023-03-17 23:33:04 -0700
3d0afcf243 Make matrix multiplication multithreaded. Mikko Juola 2023-03-17 23:25:19 -0700
8134c20d57 We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly. Mikko Juola 2023-03-17 22:42:33 -0700
1f5e687298 Modest improvement to f16 matrix_vector_mul_transposed without OpenCL. Mikko Juola 2023-03-17 14:03:56 -0700
acfd6bd5bd Add f16, non-OpenCL version of matrix_vector_mul_transposed as well. Mikko Juola 2023-03-17 13:26:58 -0700
baecd25ee3 Add f16 version of matrix multiplication that works without any OpenCL. Mikko Juola 2023-03-17 13:07:15 -0700
a1970b8a9c Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio. Mikko Juola 2023-03-17 11:04:35 -0700
61bc42b728 Improve the handwritten AVX2 for matrix_mul_inplace_transposed. Mikko Juola 2023-03-16 08:53:31 -0700
0cce655763 Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation. Mikko Juola 2023-03-16 08:34:08 -0700
09f76dfcfa Update README.md opening with new benchmark numbers. Mikko Juola 2023-03-15 12:47:32 -0700
4b8accee44 Update benchmarks. Mikko Juola 2023-03-15 12:17:36 -0700
de5dd59277 Some code cleanup in OpenCL. Mikko Juola 2023-03-15 12:07:57 -0700
8aef5d8831 Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent. Mikko Juola 2023-03-15 11:45:15 -0700
1c5ec04217 Add a different kernel to be used when OpenCL device is a CPU. Mikko Juola 2023-03-15 01:50:00 -0700
8c64313fec Rewrite the matrix multiplication. Mikko Juola 2023-03-15 01:24:32 -0700
862d4a15d6 Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable. Mikko Juola 2023-03-14 00:40:08 -0700
f4629ca987 Respect the stop token from the model. Mikko Juola 2023-03-13 22:38:20 -0700
de477314ed Fix newlines not recognized when feeding newlines in the prompt. Mikko Juola 2023-03-13 22:33:46 -0700
687bbf1249 Add instructions on how to use OpenCL in the README.md Mikko Juola 2023-03-13 21:59:45 -0700
8de18bdc77 Add screenshot to README.md. Mikko Juola 2023-03-13 21:47:20 -0700
a2e88c1193 Update README.md Mikko Juola 2023-03-13 21:42:03 -0700
17b9d90570 Add lots of code I added for OpenCL, but text generation got broken and I have no idea why. broken-opencl-code Mikko Juola 2023-03-13 21:27:04 -0700
b4d5cf91a7 Mention in README.md that using OpenCL does not cast weights to 32-bit floats. Mikko Juola 2023-03-13 17:45:23 -0700
99da6ed71a Update README.md benchmarks for new attention OpenCL thing. Mikko Juola 2023-03-13 17:44:03 -0700
35b0c372a8 Implement some attention operations for OpenCL. Mikko Juola 2023-03-13 17:38:12 -0700
6e456e64f3 Add new benchmarks now that this is partially OpenCLified. Mikko Juola 2023-03-13 17:17:09 -0700
63d27dba90 Add partial OpenCL support, it's used in feed forward network only. Mikko Juola 2023-03-13 17:11:00 -0700
df079bceb0 Add records of my benchmarks to README.md so I can compare it later. Mikko Juola 2023-03-13 13:05:32 -0700
c9c861d199 Add some measurements so we can get tokens per second. Mikko Juola 2023-03-13 12:59:07 -0700
22792b26cc Add an idea about on-disk cache for initial prompt processing (not for weights). Mikko Juola 2023-03-13 12:45:16 -0700
9087c50efa Add notes about improving sampler to README.md Mikko Juola 2023-03-13 12:41:57 -0700
1a88482988 Add some OpenCL bits. Mikko Juola 2023-03-13 12:33:21 -0700
a92017bf56 Add some initial OpenCL stuff. Mikko Juola 2023-03-12 01:20:17 -0800
53d367e6fa Add some beginnings of OpenCL implementation. Mikko Juola 2023-03-12 00:35:54 -0800
846759b277 Optimize conversions to and from f16<->32. Mikko Juola 2023-03-11 23:21:00 -0800
8acb9f32b8 Update README.md for new discoveries. Mikko Juola 2023-03-11 22:55:08 -0800
26d5309cf7 Add support for bigger models. Mikko Juola 2023-03-11 21:50:59 -0800
8a427bcb21 The project is actually called rllama, put that in readme.md. Mikko Juola 2023-03-11 12:01:55 -0800
18ef805458 Read parameters from model's JSON file instead of hard-coding them, make max sequence length configurable. Mikko Juola 2023-03-11 10:44:06 -0800
f103871bc0 Make the output colored. This is essential to be taken seriously. Mikko Juola 2023-03-11 10:21:08 -0800
cd28aba5e2 Make the output look nicer. Mikko Juola 2023-03-11 03:03:50 -0800
d7a3f57510 Update README.md, add multithreading and optimizations to some operations, allow loading prompt from a file. Mikko Juola 2023-03-11 02:46:21 -0800
8bb9404168 Update README to clarify this is a Rust project and to show how to change temperature, top_k, top_p stuff. Mikko Juola 2023-03-11 00:47:32 -0800
f6217e0036 Add readme, make clippy happy. Mikko Juola 2023-03-11 00:40:28 -0800
3b8f904f13 First commit. LLaMA works now. It is not pretty but it does generate text from prompts. Yay. Mikko Juola 2023-03-11 00:31:40 -0800

Commit Graph Select branches Hide Pull Requests broken-opencl-code k4bit master Mono Color

Commit Graph

Select branches

Hide Pull Requests

broken-opencl-code

k4bit

master