rllama

Commit Graph

Author	SHA1	Message	Date
Mikko Juola	f6249e8d9f	Add skeleton code for 4-bit quantization. The type is now recognized and I have a very simple quantizer too but no operations are done yet.	3 years ago
Mikko Juola	26f343ad15	Add a flag that will exit the HTTP server after just one query. This is for some experiments I want to run to kill the server gracefully whenever I pull out the logits out of it from a Python script.	3 years ago
Mikko Juola	957a8f9f98	Mention that `server` feature must be turned on to use the inference API.	3 years ago
Mikko Juola	5e241722cb	Fix compilation when opencl feature is being used.	3 years ago
Mikko Juola	d85ed7f23e	Mention HTTP server in features in README.md	3 years ago
Mikko Juola	a8320613a1	Fix some things in README.md after proofreading it and removing lies.	3 years ago
Mikko Juola	b9be485610	Add simple HTTP API support. It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.	3 years ago
Mikko Juola	9c86c17318	Refactor all SIMD to one file, simd_support.rs This should make it a bit easier to port to other SIMD instruction sets when the SIMD instructions are not littered randomly around the tensor.rs file.	3 years ago
Mikko Juola	25e3e12d9d	Update README.md on LLaMA-65B benchmark result.	3 years ago
Mikko Juola	f233f8ad8f	Forgot to mark last benchmark at March 17	3 years ago
Mikko Juola	db0f22ed26	Update README.md, add a nice animation.	3 years ago
Mikko Juola	cfad4b1205	Bump version to 0.3.0	3 years ago
Mikko Juola	016b609481	More install instructions.	3 years ago
Mikko Juola	2666571e2b	Update README.md to show `rllama` is on crates.io now.	3 years ago
Mikko Juola	58b61cba39	Bump version to 0.2.0	3 years ago
Mikko Juola	ebdea727fd	Don't let the crate be built without avx2, avx, etc. or it'll be very slow.	3 years ago
Mikko Juola	f2c38a272f	Update Cargo dependencies.	3 years ago
Mikko Juola	91dee4f114	Add --quiet flag, make colors respect --quiet so you just get the output and nothing else.	3 years ago
Mikko Juola	109171b50e	Mention that this is AMD64 only because of AVX2.	3 years ago
Mikko Juola	ff349eeea0	Make number of threads configurable and obtained by default from the system rather than hardcoding to 32.	3 years ago
Mikko Juola	44e0abf0f1	Clarify that the OpenCL implementations all use f16.	3 years ago
Mikko Juola	58463458ee	Put benchmarks on top of README.md.	3 years ago
Mikko Juola	882ff05254	Update README.md for new benchmarks.	3 years ago
Mikko Juola	3d0afcf243	Make matrix multiplication multithreaded. This improves performance greatly with f16. It's faster now than OpenCL on LLaMA-7B.	3 years ago
Mikko Juola	8134c20d57	We can now run in (mostly) f16 mode without any OpenCL. It's not the fastest way but right now it looks like most memory friendly.	3 years ago
Mikko Juola	1f5e687298	Modest improvement to f16 matrix_vector_mul_transposed without OpenCL. It's still signicantly slower than the f32 version.	3 years ago
Mikko Juola	acfd6bd5bd	Add f16, non-OpenCL version of matrix_vector_mul_transposed as well. This seems to be 100% slower than the pure f32 version in benchmark. Not sure why as of this commit, but I'll investigate further.	3 years ago
Mikko Juola	baecd25ee3	Add f16 version of matrix multiplication that works without any OpenCL. In benchmark it is modestly faster than f32. The main transformer loop doesn't know how to use f16 yet though, and I need to implement some other ops for that to start working.	3 years ago
Mikko Juola	a1970b8a9c	Improve matrix multiplication transposed further, this gives around ~10%-20% further increase by improving memory load to instruction ratio.	3 years ago
Mikko Juola	61bc42b728	Improve the handwritten AVX2 for matrix_mul_inplace_transposed. This is something like ~60% faster than old version.	3 years ago
Mikko Juola	0cce655763	Unroll the handwritten AVX2 matrix_vector_mul_transposed slightly, gives ~20% boost to that operation. Modest improvement in overall performance for text generation.	3 years ago
Mikko Juola	09f76dfcfa	Update README.md opening with new benchmark numbers.	3 years ago
Mikko Juola	4b8accee44	Update benchmarks.	3 years ago
Mikko Juola	de5dd59277	Some code cleanup in OpenCL.	3 years ago
Mikko Juola	8aef5d8831	Rename to_gpu and to_cpu to to_gpu_inplace and to_cpu_inplace to make _inplace use consistent.	3 years ago
Mikko Juola	1c5ec04217	Add a different kernel to be used when OpenCL device is a CPU. This is almost the same code I had before. It runs better on CPUs rather than GPUs.	3 years ago
Mikko Juola	8c64313fec	Rewrite the matrix multiplication. This is something like ~10 times faster than the old one. But surprisingly this didn't have much impact on text generation time. Maybe most of the remaining slowness is no more from matrix multiplication. Also this slowed down CPU implementation. I think I'll try adding another kernel later for CPU OpenCL.	3 years ago
Mikko Juola	862d4a15d6	Add repetition penalty, add colors to outputs based on probabilities, try to make softmax() more numerically stable.	3 years ago
Mikko Juola	f4629ca987	Respect the stop token from the model.	3 years ago
Mikko Juola	de477314ed	Fix newlines not recognized when feeding newlines in the prompt. Tokenizer would misinterpret the newlines. In general, the non-printable control characters don't seem to be tokenized correctly at the moment. I added band-aid for newlines but should maybe fix the others too.	3 years ago
Mikko Juola	687bbf1249	Add instructions on how to use OpenCL in the README.md	3 years ago
Mikko Juola	8de18bdc77	Add screenshot to README.md.	3 years ago
Mikko Juola	a2e88c1193	Update README.md	3 years ago
Mikko Juola	b4d5cf91a7	Mention in README.md that using OpenCL does not cast weights to 32-bit floats.	3 years ago
Mikko Juola	99da6ed71a	Update README.md benchmarks for new attention OpenCL thing.	3 years ago
Mikko Juola	35b0c372a8	Implement some attention operations for OpenCL.	3 years ago
Mikko Juola	6e456e64f3	Add new benchmarks now that this is partially OpenCLified.	3 years ago
Mikko Juola	63d27dba90	Add partial OpenCL support, it's used in feed forward network only.	3 years ago
Mikko Juola	df079bceb0	Add records of my benchmarks to README.md so I can compare it later.	3 years ago
Mikko Juola	c9c861d199	Add some measurements so we can get tokens per second.	3 years ago

1 2

66 Commits (f6249e8d9fa1a0be7000cfb56d9ca73971bd2584) All Branches Search

66 Commits (f6249e8d9fa1a0be7000cfb56d9ca73971bd2584)

All Branches