Skip to content

Model Selection Guide

Choose the right model and quantization for your use case with llcuda v2.1.0.

Quick Recommendations

For Tesla T4 (15 GB)

Priority Model Quantization Speed VRAM Quality
Speed Gemma 3-1B Q4_K_M 134 tok/s 1.2 GB Excellent
Balance Llama 3.2-3B Q4_K_M 48 tok/s 2.0 GB Very good
Quality Qwen 2.5-7B Q4_K_M 21 tok/s 5.0 GB Excellent

For Limited VRAM (< 8 GB)

GPU VRAM Recommended Model Quantization Expected Speed
4 GB Gemma 3-1B Q4_0 ~140 tok/s
6 GB Gemma 3-1B Q4_K_M ~134 tok/s
8 GB Llama 3.2-3B Q4_K_M ~48 tok/s

Model Size Comparison

Performance vs Quality Trade-off

Model Family Size Params Tokens/sec (T4) VRAM Best For
Gemma 3 1B 1.2B 134 1.2 GB Interactive apps, chatbots
Llama 3.2 3B 3.2B 48 2.0 GB Balanced performance
Qwen 2.5 7B 7.6B 21 5.0 GB Quality-focused tasks
Llama 3.1 8B 8.0B 19 5.5 GB Production quality
Mistral 7B 7.2B 22 5.2 GB Code generation

Detailed Comparison

1B Models (Best for Speed)

Gemma 3-1B-it

  • Speed: 134 tok/s (Q4_K_M)
  • VRAM: 1.2 GB
  • Strengths:
  • Fastest inference
  • Excellent for interactive chat
  • Low VRAM requirements
  • Good quality for size
  • Weaknesses:
  • Limited reasoning on complex tasks
  • Shorter context understanding
  • Use Cases:
  • Customer service chatbots
  • Quick Q&A systems
  • Real-time code assistance
  • Mobile/edge deployment
import llcuda

engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

3B Models (Balanced)

Llama 3.2-3B-Instruct

  • Speed: 48 tok/s (Q4_K_M)
  • VRAM: 2.0 GB
  • Strengths:
  • Good balance of speed/quality
  • Better reasoning than 1B
  • Handles complex instructions
  • Still fast enough for real-time
  • Weaknesses:
  • 3x slower than 1B models
  • Higher VRAM usage
  • Use Cases:
  • Content generation
  • Code explanation
  • Document summarization
  • Educational applications
engine.load_model(
    "unsloth/Llama-3.2-3B-Instruct-Q4_K_M-GGUF",
    silent=True
)

7B Models (Quality-Focused)

Qwen 2.5-7B-Instruct

  • Speed: 21 tok/s (Q4_K_M)
  • VRAM: 5.0 GB
  • Strengths:
  • Excellent quality
  • Strong reasoning abilities
  • Great for complex tasks
  • Multilingual support
  • Weaknesses:
  • 6x slower than 1B
  • Requires 5+ GB VRAM
  • Use Cases:
  • Research and analysis
  • Complex reasoning tasks
  • Technical documentation
  • Multi-step problem solving
engine.load_model(
    "Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
    silent=True
)

Llama 3.1-8B-Instruct

  • Speed: 19 tok/s (Q4_K_M)
  • VRAM: 5.5 GB
  • Strengths:
  • State-of-the-art quality
  • Excellent instruction following
  • Long context support (128K)
  • Multilingual
  • Use Cases:
  • Production applications
  • API services
  • Complex workflows
  • Enterprise deployments
engine.load_model(
    "unsloth/Llama-3.1-8B-Instruct-Q4_K_M-GGUF",
    silent=True
)

Quantization Guide

Understanding Quantization Types

Quantization Bits Speed Quality VRAM File Size Recommendation
Q2_K 2.5 Fastest 85% Lowest ~30% Prototyping only
Q3_K_M 3.5 Very fast 92% Very low ~40% Emergency low VRAM
Q4_0 4.0 Fast 97% Low ~45% Speed priority
Q4_K_M 4.5 Fast 99% Medium ~50% Recommended
Q5_K_M 5.5 Moderate 99.5% Medium-high ~60% Quality critical
Q6_K 6.5 Slow 99.8% High ~70% Rarely needed
Q8_0 8.0 Slower 99.95% Very high ~85% Development only
F16 16.0 Slowest 100% Maximum 100% Not recommended

Choosing Quantization

For most users:

# Q4_K_M: Best overall choice
engine.load_model(
    "model-Q4_K_M.gguf",
    silent=True
)

For speed-critical applications:

# Q4_0: 3-5% faster, slightly lower quality
engine.load_model(
    "model-Q4_0.gguf",
    silent=True
)

For quality-critical work:

# Q5_K_M: Better quality, 20% slower
engine.load_model(
    "model-Q5_K_M.gguf",
    silent=True
)

For extreme VRAM constraints:

# Q3_K_M: Smallest usable quantization
engine.load_model(
    "model-Q3_K_M.gguf",
    silent=True
)


Unsloth provides optimized GGUF models on HuggingFace:

Gemma Models:

# Gemma 3-1B (Best for speed)
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"

# Gemma 2-2B
"unsloth/gemma-2-2b-it-GGUF:gemma-2-2b-it-Q4_K_M.gguf"

# Gemma 2-9B
"unsloth/gemma-2-9b-it-GGUF:gemma-2-9b-it-Q4_K_M.gguf"

Llama Models:

# Llama 3.2-1B
"unsloth/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf"

# Llama 3.2-3B
"unsloth/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-Q4_K_M.gguf"

# Llama 3.1-8B
"unsloth/Llama-3.1-8B-Instruct-GGUF:Llama-3.1-8B-Instruct-Q4_K_M.gguf"

Mistral Models:

# Mistral 7B v0.3
"unsloth/Mistral-7B-Instruct-v0.3-GGUF:Mistral-7B-Instruct-v0.3-Q4_K_M.gguf"

# Mistral Nemo 12B
"unsloth/Mistral-Nemo-Instruct-2407-GGUF:Mistral-Nemo-Instruct-2407-Q4_K_M.gguf"

Official HuggingFace Models

Qwen Models:

# Qwen 2.5-7B (Excellent quality)
"Qwen/Qwen2.5-7B-Instruct-GGUF:qwen2.5-7b-instruct-q4_k_m.gguf"

# Qwen 2.5-14B
"Qwen/Qwen2.5-14B-Instruct-GGUF:qwen2.5-14b-instruct-q4_k_m.gguf"

Phi Models:

# Phi 3.5-Mini (3.8B)
"microsoft/Phi-3.5-mini-instruct-gguf:Phi-3.5-mini-instruct-Q4_K_M.gguf"


VRAM Requirements

Model Size to VRAM Mapping

For Q4_K_M quantization:

Model Size Q4_K_M VRAM Q5_K_M VRAM Q8_0 VRAM ctx=2048
1B 1.2 GB 1.5 GB 2.5 GB Add +0.3 GB
3B 2.0 GB 2.4 GB 4.2 GB Add +0.3 GB
7B 5.0 GB 6.2 GB 9.5 GB Add +0.5 GB
8B 5.5 GB 6.8 GB 10.2 GB Add +0.5 GB
13B 9.0 GB 11.0 GB 16.5 GB Add +0.8 GB

GPU Recommendations

GPU VRAM Max Model (Q4_K_M) Recommended Model
Tesla T4 15 GB 7B 1B (speed) or 7B (quality)
RTX 3060 12 GB 7B 3B
RTX 3070 8 GB 3B 1B
RTX 3080 10 GB 7B 3B
RTX 3090 24 GB 13B 7B
RTX 4070 12 GB 7B 3B
RTX 4090 24 GB 13B 7B or 13B
A100 40 GB 30B 13B
A100 80 GB 70B 30B

Use Case Recommendations

Interactive Chatbots

Priority: Speed, low latency

Recommended: - Gemma 3-1B Q4_K_M (134 tok/s) - Llama 3.2-1B Q4_K_M (140 tok/s)

engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    ctx_size=2048,
    silent=True
)

Code Generation

Priority: Accuracy, context understanding

Recommended: - Qwen 2.5-7B Q4_K_M (21 tok/s) - Llama 3.1-8B Q4_K_M (19 tok/s)

engine.load_model(
    "Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
    ctx_size=4096,  # Longer context for code
    silent=True
)

Document Summarization

Priority: Context length, quality

Recommended: - Llama 3.1-8B Q4_K_M (128K context) - Qwen 2.5-7B Q4_K_M

engine.load_model(
    "unsloth/Llama-3.1-8B-Instruct-Q4_K_M-GGUF",
    ctx_size=8192,  # Long documents
    silent=True
)

Question Answering

Priority: Accuracy, speed

Recommended: - Llama 3.2-3B Q4_K_M (48 tok/s) - Gemma 3-1B Q4_K_M (134 tok/s)

engine.load_model(
    "unsloth/Llama-3.2-3B-Instruct-Q4_K_M-GGUF",
    ctx_size=2048,
    silent=True
)

Content Generation

Priority: Creativity, quality

Recommended: - Qwen 2.5-7B Q5_K_M - Llama 3.1-8B Q5_K_M

engine.load_model(
    "Qwen/Qwen2.5-7B-Instruct-GGUF:Q5_K_M",
    ctx_size=4096,
    silent=True
)

# Use creative generation settings
result = engine.infer(
    prompt,
    temperature=1.0,
    top_p=0.95,
    max_tokens=500
)

Education & Tutoring

Priority: Accuracy, explanations

Recommended: - Llama 3.2-3B Q4_K_M - Qwen 2.5-7B Q4_K_M

engine.load_model(
    "unsloth/Llama-3.2-3B-Instruct-Q4_K_M-GGUF",
    ctx_size=2048,
    silent=True
)

Model Capabilities

Multilingual Support

Model Languages Notes
Gemma 3-1B English primarily Limited multilingual
Llama 3.2-3B 8 languages Good multilingual
Llama 3.1-8B 8 languages Excellent multilingual
Qwen 2.5-7B 29 languages Best multilingual
Mistral 7B English, French, German, Spanish, Italian Good European languages

Context Window Support

Model Standard Context Max Context Notes
Gemma 3-1B 2K 8K Limited long context
Llama 3.2-3B 4K 128K Excellent long context
Llama 3.1-8B 8K 128K Best long context
Qwen 2.5-7B 8K 32K Good long context
Mistral 7B 8K 32K Good long context

Special Capabilities

Model Code Math Reasoning Function Calling
Gemma 3-1B Good Fair Fair No
Llama 3.2-3B Very Good Good Good Yes
Llama 3.1-8B Excellent Very Good Excellent Yes
Qwen 2.5-7B Excellent Excellent Excellent Yes
Mistral 7B Very Good Good Good Yes

Finding and Loading Models

# Browse models at: https://huggingface.co/unsloth
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

From Official Repos

# Qwen
engine.load_model(
    "Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
    silent=True
)

# Microsoft Phi
engine.load_model(
    "microsoft/Phi-3.5-mini-instruct-gguf:Phi-3.5-mini-instruct-Q4_K_M.gguf",
    silent=True
)

Local Models

# Load from local path
engine.load_model(
    "/path/to/model.gguf",
    silent=True
)

Model Evaluation

Quick Quality Test

import llcuda

def evaluate_model(model_path):
    """Quick quality evaluation."""

    engine = llcuda.InferenceEngine()
    engine.load_model(model_path, silent=True)

    test_prompts = [
        "Explain quantum computing in simple terms.",
        "Write a Python function to calculate factorial.",
        "What are the causes of climate change?",
        "Translate 'Hello, how are you?' to Spanish.",
        "Solve: If x + 5 = 12, what is x?"
    ]

    print(f"\n{'='*60}")
    print(f"Evaluating: {model_path}")
    print(f"{'='*60}\n")

    for i, prompt in enumerate(test_prompts, 1):
        result = engine.infer(prompt, max_tokens=150)

        print(f"{i}. {prompt}")
        print(f"   Response: {result.text[:100]}...")
        print(f"   Speed: {result.tokens_per_sec:.1f} tok/s\n")

    metrics = engine.get_metrics()
    print(f"Average speed: {metrics['throughput']['tokens_per_sec']:.1f} tok/s")
    print(f"Average latency: {metrics['latency']['mean_ms']:.0f}ms")

# Test multiple models
models = [
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    "unsloth/Llama-3.2-3B-Instruct-Q4_K_M-GGUF",
]

for model in models:
    evaluate_model(model)

Migration Guide

From Larger to Smaller Models

If you need to reduce VRAM:

# Before: 7B model (5 GB VRAM)
engine.load_model(
    "Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
    silent=True
)

# After: 3B model (2 GB VRAM)
engine.load_model(
    "unsloth/Llama-3.2-3B-Instruct-Q4_K_M-GGUF",
    silent=True
)

From Higher to Lower Quantization

# Before: Q5_K_M (better quality, slower)
engine.load_model("model-Q5_K_M.gguf", silent=True)

# After: Q4_K_M (faster, minimal quality loss)
engine.load_model("model-Q4_K_M.gguf", silent=True)

See Also