Update README.md

main
randaller 3 years ago committed by GitHub
parent 23b3c50a9c
commit a9e88df68f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -61,11 +61,11 @@ python example-chat.py
Running model with single prompt on Windows computer equipped with 12700k, fast nvme and 128 Gb of RAM.
| model | RAM usage, fp32 | RAM usage, bf16 | fp32 inference | bf16 inference |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| 7B | 44 Gb, peak 56 Gb | 22 Gb | 170 seconds | 850 seconds |
| 13B | 77 Gb, peak 100 Gb | 38 Gb | 340 seconds | |
| 30B | 180 Gb, peak 258 Gb | | | |
| model | RAM usage, fp32 | RAM usage, bf16 | fp32 inference | bf16 inference | fp32 load model |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 7B | 44 Gb, peak 56 Gb | 22 Gb | 170 seconds | 850 seconds | 23 seconds |
| 13B | 77 Gb, peak 100 Gb | 38 Gb | 340 seconds | | 61 seconds |
| 30B | 180 Gb, peak 258 Gb | | | | 372 seconds |
### RAM usage optimization
By default, torch uses Float32 precision while running on CPU, which leads, for example, to use 44 GB of RAM for 7B model. We may use Bfloat16 precision on CPU too, which decreases RAM consumption/2, down to 22 GB for 7B model, but inference processing much slower.

Loading…
Cancel
Save