Skip to content

GGUF Format Overview

Understanding GGUF quantization in llcuda.

What is GGUF?

GGUF (GPT-Generated Unified Format): - Binary model format - Efficient quantization - Fast loading - llama.cpp native format

Quantization Types

  • Q4_K_M: 4.8 bpw, best for most models
  • Q5_K_M: 5.7 bpw, higher quality
  • Q6_K: 6.6 bpw, near FP16
  • Q8_0: 8.5 bpw, very high quality

I-Quants (Compression)

  • IQ3_XS: 3.3 bpw, for 70B models
  • IQ4_XS: 4.3 bpw, better quality
  • IQ2_XS: 2.3 bpw, extreme compression

Legacy

  • Q4_0: 4.5 bpw
  • Q5_0: 5.5 bpw

Selection Guide

VRAM Model Size Recommended Quant
5GB 1-3B Q4_K_M
10GB 7-8B Q4_K_M
15GB 13B Q4_K_M
30GB 70B IQ3_XS

See: - K-Quants Guide - I-Quants Guide - Selection Guide