Skip to content

GGUF Format Guide

Complete guide to the GGUF (GPT-Generated Unified Format) model format used by llcuda v2.1.0.

What is GGUF?

GGUF (GPT-Generated Unified Format) is a binary format for storing large language models developed by the llama.cpp project.

Key Features

Single-file distribution - Everything in one portable file ✅ Efficient storage - Compact binary format with compression ✅ Memory mapping - Fast loading without full RAM allocation ✅ Quantization support - Multiple precision levels (INT4, INT8, FP16) ✅ Metadata included - Model architecture, tokenizer, and configuration ✅ Cross-platform - Works on Linux, macOS, Windows ✅ GPU acceleration - Full CUDA support for inference

Why GGUF?

GGUF replaced the older GGML format and offers significant improvements:

Feature GGML (Old) GGUF (Current)
Metadata External files Embedded
Versioning Limited Full versioning
Tokenizer Separate file Included
Architecture Hard-coded Dynamic
Compatibility Breaking changes Forward compatible

GGUF File Structure

A GGUF file contains:

GGUF File (.gguf)
├── Header (magic number, version)
├── Metadata (KV pairs)
│   ├── Architecture (llama, gemma, qwen, etc.)
│   ├── Model parameters (layers, heads, etc.)
│   ├── Tokenizer (vocabulary, special tokens)
│   ├── Quantization method
│   └── Author, license, source
├── Tensor info (names, shapes, offsets)
└── Tensor data (model weights)

Example GGUF Metadata

import llcuda
from llcuda.gguf_parser import GGUFReader

# Read GGUF metadata
reader = GGUFReader("gemma-3-1b-it-Q4_K_M.gguf")

print(f"Architecture: {reader.architecture}")
print(f"Parameter count: {reader.parameter_count}")
print(f"Quantization: {reader.quantization}")
print(f"Context length: {reader.context_length}")

Quantization Types

GGUF supports multiple quantization methods that trade off quality for size and speed.

Quantization Comparison

Type Bits Size Multiplier Quality Speed Use Case
F16 16 1.0x (largest) Best Slowest Reference quality
Q8_0 8 0.5x Excellent Slow High quality needed
Q6_K 6 0.4x Very good Medium Balanced
Q5_K_M 5 0.35x Good Medium-fast Good balance
Q4_K_M 4 0.25x Good Fast Recommended
Q4_K_S 4 0.25x Acceptable Fast Smaller variant
Q3_K_M 3 0.2x Fair Very fast Experimental
Q2_K 2 0.15x (smallest) Poor Fastest Testing only

For Tesla T4 GPUs, Q4_K_M provides the best balance:

Good quality - Minimal accuracy loss vs FP16 ✅ Fast inference - 134 tok/s on Gemma 3-1B ✅ Small size - 4 bits per parameter ✅ Low VRAM - Fits larger models in 16 GB

Example sizes for Gemma 3-1B: - F16: ~2.6 GB - Q8_0: ~1.4 GB - Q4_K_M: ~650 MB ← Recommended - Q2_K: ~400 MB

Using GGUF Models with llcuda

Load directly from Unsloth or other HuggingFace repositories:

import llcuda

engine = llcuda.InferenceEngine()

# Load from Unsloth repository
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)

# Format: repo_id:filename

Popular Unsloth GGUF models: - unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf - unsloth/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-Q4_K_M.gguf - unsloth/Qwen2.5-7B-Instruct-GGUF:Qwen2.5-7B-Instruct-Q4_K_M.gguf

Method 2: From Local File

Use a downloaded GGUF file:

engine.load_model("/path/to/model.gguf")

Method 3: From URL

Direct download from any URL:

engine.load_model(
    "https://huggingface.co/user/repo/resolve/main/model.gguf"
)

Converting Models to GGUF

From PyTorch/HuggingFace

Use the convert_hf_to_gguf.py script from llama.cpp:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Install dependencies
pip install -r requirements.txt

# Convert model
python convert_hf_to_gguf.py \
    /path/to/huggingface/model \
    --outfile model-f16.gguf \
    --outtype f16

From Unsloth Fine-Tuned Models

Export directly from Unsloth:

from unsloth import FastLanguageModel

# After fine-tuning
model.save_pretrained_gguf(
    "my_model",
    tokenizer,
    quantization_method="q4_k_m"  # Creates Q4_K_M GGUF
)

# Output: my_model/unsloth.Q4_K_M.gguf

Supported quantization methods: - "f16" - Full precision - "q8_0" - 8-bit quantization - "q6_k" - 6-bit K-quant - "q5_k_m" - 5-bit K-quant medium - "q4_k_m" - 4-bit K-quant medium (recommended) - "q4_k_s" - 4-bit K-quant small - "q3_k_m" - 3-bit K-quant medium - "q2_k" - 2-bit K-quant

Quantizing Existing GGUF

Convert between quantization levels:

# Using llama-quantize (included with llcuda binaries)
~/.cache/llcuda/bin/llama-quantize \
    model-f16.gguf \
    model-q4_k_m.gguf \
    Q4_K_M

Available quantization types:

Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K
IQ1_S, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S

GGUF Inspection Tools

Using llcuda

from llcuda.gguf_parser import GGUFReader

reader = GGUFReader("model.gguf")

print(f"Architecture: {reader.architecture}")
print(f"Quantization: {reader.quantization}")
print(f"Parameter count: {reader.parameter_count:,}")
print(f"Context length: {reader.context_length}")
print(f"Embedding size: {reader.embedding_size}")
print(f"Layers: {reader.num_layers}")
print(f"Heads: {reader.num_heads}")
print(f"File size: {reader.file_size / 1024**3:.2f} GB")

Using llama.cpp Tools

# Check GGUF metadata
~/.cache/llcuda/bin/llama-cli \
    --model model.gguf \
    --verbose

Model Compatibility

Supported Architectures

llcuda v2.1.0 supports these model architectures via GGUF:

LLaMA (LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2) ✅ Gemma (Gemma, Gemma-2, Gemma-3) ✅ Qwen (Qwen, Qwen-2, Qwen-2.5) ✅ Mistral (Mistral, Mistral-7B) ✅ Mixtral (Mixtral 8x7B, 8x22B) ✅ Phi (Phi-2, Phi-3) ✅ Yi (Yi-6B, Yi-34B) ✅ StableLM (StableLM-2, StableLM-3)

Checking Compatibility

import llcuda

# Check if model is compatible
compat = llcuda.check_model_compatibility("model.gguf")

print(f"Compatible: {compat['compatible']}")
print(f"Architecture: {compat['architecture']}")
print(f"Warnings: {compat.get('warnings', [])}")

GGUF Best Practices

1. Choose Right Quantization

For Tesla T4: - Small models (1-3B): Q4_K_M or Q5_K_M - Medium models (7-8B): Q4_K_M (fits in VRAM) - Large models (13B+): Q4_K_M or Q3_K_M (if needed)

2. Verify GGUF Integrity

from llcuda.gguf_parser import GGUFReader

try:
    reader = GGUFReader("model.gguf")
    print("✅ Valid GGUF file")
except Exception as e:
    print(f"❌ Invalid GGUF: {e}")

3. Test Before Production

# Quick test
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf", silent=True)

result = engine.infer("Test prompt", max_tokens=20)
print(f"Output: {result.text}")
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

4. Optimize Storage

Use Q4_K_M for distribution: - Smaller download size - Faster loading - Good quality - Better inference speed

GGUF vs Other Formats

Format Size Speed Compatibility Ease of Use
GGUF Small Fast llama.cpp ✅ Easy
SafeTensors Large Medium PyTorch Medium
PyTorch (.pt) Large Medium PyTorch only Medium
ONNX Large Fast ONNX Runtime Complex
TensorRT Custom Fastest NVIDIA only Complex

Why GGUF for llcuda: - ✅ Smallest file size (with quantization) - ✅ Fast inference on CPU and GPU - ✅ Single-file distribution - ✅ Works with llama.cpp ecosystem - ✅ Easy to share and deploy

Finding GGUF Models

Unsloth HuggingFace

Most popular source for GGUF models:

https://huggingface.co/unsloth

Example repositories: - unsloth/gemma-3-1b-it-GGUF - unsloth/Llama-3.2-3B-Instruct-GGUF - unsloth/Qwen2.5-7B-Instruct-GGUF - unsloth/Meta-Llama-3.1-8B-Instruct-GGUF

TheBloke (Legacy)

Older GGUF models (pre-Unsloth era):

https://huggingface.co/TheBloke

Bartowski

Recent high-quality quantizations:

https://huggingface.co/bartowski

Troubleshooting GGUF Issues

Issue: Invalid GGUF Magic Number

Error: Invalid GGUF file: wrong magic number

Solution: - File is corrupted or incomplete - Re-download the GGUF file - Verify SHA256 checksum

Issue: Unsupported Quantization

Error: Quantization type not supported

Solution: - Use Q4_K_M, Q5_K_M, or Q8_0 - Avoid experimental quantizations (IQ types) - Re-quantize with llama-quantize

Issue: Model Too Large

Error: CUDA out of memory

Solution: - Use lower quantization (Q4_K_M instead of Q8_0) - Use smaller model variant - Clear GPU cache before loading

Advanced GGUF Topics

Custom Metadata

Add custom metadata to GGUF:

from llcuda.gguf_parser import GGUFWriter

writer = GGUFWriter("output.gguf")
writer.add_metadata("author", "Your Name")
writer.add_metadata("description", "Fine-tuned for specific task")
writer.add_metadata("license", "MIT")
writer.finalize()

Merging GGUF Models

Combine multiple LoRA adapters (experimental):

# Using llama.cpp tools
llama-export-lora \
    base-model.gguf \
    lora-adapter.gguf \
    merged-model.gguf

References

Next Steps


GGUF makes LLM deployment simple and efficient! 🚀