I-Quants (Importance Quantization)¶
Ultra-compressed quantization for 70B models on 30GB VRAM.
Overview¶
I-quants use importance-based quantization to achieve extreme compression while maintaining quality.
I-Quant Types¶
| Type | Bits | 70B VRAM | Quality | Use Case |
|---|---|---|---|---|
| IQ3_XS | ~3-bit | ~28 GB | Good | 70B on dual T4 |
| IQ2_XXS | ~2-bit | ~21 GB | Fair | Ultra-compressed |
When to Use I-Quants¶
- Running 70B models on 30GB VRAM
- Dual T4 Kaggle setup
- Prioritize model size over quality
Example¶
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/Llama-3.1-70B-Instruct-GGUF",
filename="Llama-3.1-70B-Instruct-IQ3_XS.gguf"
)
Performance¶
- Llama-70B IQ3_XS: ~12 tokens/sec on dual T4
- VRAM: ~28-29 GB total