From 8c59958afce6cdea87fbbedb0fbecffea4dd6a5d Mon Sep 17 00:00:00 2001 From: randaller Date: Sun, 19 Mar 2023 16:03:00 +0300 Subject: [PATCH] Update README.md --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index b159b61..9cc23b9 100644 --- a/README.md +++ b/README.md @@ -189,6 +189,20 @@ import torch torch.set_default_dtype(torch.bfloat16) ``` +### Offload to GPU with accelerate + +``` +device_map = infer_auto_device_map(model, max_memory={0: "6GiB", "cpu": "128GiB"}) +``` + +One with A100 might try to set 38Gb to a GPU and try to inference the model completely in the GPU VRAM. + +For me, with 6Gb for 3070ti, this works three times slower against pure CPU inference. + +``` +python hf-inference-cuda-example.py +``` + ## Reference LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971