Performance Optimization Tutorial¶
Learn how to optimize llcuda for maximum throughput and minimum latency on Tesla T4 GPUs.
Quick Win
For immediate performance gains, use Q4_K_M quantization with full GPU offload (gpu_layers=99). This achieves 130+ tok/s on Gemma 3-1B.
Performance Overview¶
llcuda v2.1.0 achieves exceptional performance on Tesla T4:
- Gemma 3-1B: 134 tok/s (verified)
- Latency: < 700ms median
- Memory: 1.2 GB for 1B models
- Throughput: Consistent across batch sizes
Key Performance Factors¶
1. Quantization Method¶
Choose the right quantization for your use case:
Performance: 134 tok/s (Gemma 3-1B) Memory: 1.2 GB Quality: Excellent (< 1% degradation) Use case: Production inference
Performance: ~110 tok/s Memory: 1.5 GB Quality: Near-perfect Use case: Quality-critical applications
Recommendation: Use Q4_K_M for best performance/quality balance.
2. GPU Layer Offloading¶
Control how many layers run on GPU:
# Full GPU offload (fastest)
engine.load_model("model.gguf", gpu_layers=99) # 134 tok/s
# Partial offload (if VRAM limited)
engine.load_model("model.gguf", gpu_layers=20) # ~80 tok/s
# CPU only (very slow)
engine.load_model("model.gguf", gpu_layers=0) # ~8 tok/s
Rule of thumb: Always use gpu_layers=99 unless you have VRAM constraints.
3. Context Window Size¶
Balance between functionality and speed:
# Small context (fastest)
engine.load_model("model.gguf", ctx_size=1024) # +10% speed
# Medium context (balanced)
engine.load_model("model.gguf", ctx_size=2048) # Baseline
# Large context (slower)
engine.load_model("model.gguf", ctx_size=8192) # -20% speed
Memory impact: - 1024 ctx: +0.5 GB - 2048 ctx: +1.0 GB - 4096 ctx: +2.0 GB - 8192 ctx: +4.0 GB
4. Batch Processing¶
Use batch sizes for throughput:
# Configure batch parameters
engine.load_model(
"model.gguf",
batch_size=512, # Logical batch size
ubatch_size=128, # Physical batch size
gpu_layers=99
)
Batch size guidelines:
| Model Size | batch_size | ubatch_size | Throughput |
|---|---|---|---|
| 1B params | 512 | 128 | 134 tok/s |
| 3B params | 256 | 64 | ~100 tok/s |
| 7B params | 128 | 32 | ~50 tok/s |
5. Flash Attention¶
llcuda v2.1.0 includes FlashAttention by default:
# FlashAttention is automatically enabled for:
# - Compute capability 7.5+ (T4, RTX 20xx+)
# - Context sizes > 2048
# - All quantization types
# Benefit: 2-3x faster for long contexts
Performance with FlashAttention:
| Context Size | Without FA | With FA | Speedup |
|---|---|---|---|
| 512 | 140 tok/s | 142 tok/s | 1.01x |
| 2048 | 134 tok/s | 135 tok/s | 1.01x |
| 4096 | 95 tok/s | 125 tok/s | 1.32x |
| 8192 | 55 tok/s | 105 tok/s | 1.91x |
Optimization Workflow¶
Step 1: Baseline Measurement¶
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"gemma-3-1b-Q4_K_M",
auto_start=True,
verbose=True
)
# Run baseline test
prompts = ["Test prompt"] * 10
results = engine.batch_infer(prompts, max_tokens=100)
# Check metrics
metrics = engine.get_metrics()
print(f"Baseline speed: {metrics['throughput']['tokens_per_sec']:.1f} tok/s")
print(f"Baseline latency: {metrics['latency']['mean_ms']:.0f} ms")
Step 2: Optimize GPU Offload¶
# Test different GPU layer counts
for gpu_layers in [10, 20, 35, 99]:
engine.unload_model()
engine.load_model(
"model.gguf",
gpu_layers=gpu_layers,
auto_start=True,
verbose=False
)
result = engine.infer("Test", max_tokens=50)
print(f"gpu_layers={gpu_layers}: {result.tokens_per_sec:.1f} tok/s")
# Expected output:
# gpu_layers=10: 65.2 tok/s
# gpu_layers=20: 92.1 tok/s
# gpu_layers=35: 127.5 tok/s
# gpu_layers=99: 134.2 tok/s ← Best
Step 3: Optimize Context Size¶
# Test different context sizes
for ctx_size in [512, 1024, 2048, 4096]:
engine.unload_model()
engine.load_model(
"model.gguf",
ctx_size=ctx_size,
gpu_layers=99,
auto_start=True,
verbose=False
)
result = engine.infer("Test", max_tokens=50)
print(f"ctx_size={ctx_size}: {result.tokens_per_sec:.1f} tok/s")
# Choose smallest ctx_size that meets your needs
Step 4: Optimize Batch Parameters¶
# Test batch configurations
configs = [
(256, 64), # Small
(512, 128), # Medium (default)
(1024, 256), # Large
]
for batch_size, ubatch_size in configs:
engine.unload_model()
engine.load_model(
"model.gguf",
batch_size=batch_size,
ubatch_size=ubatch_size,
gpu_layers=99,
auto_start=True,
verbose=False
)
result = engine.infer("Test", max_tokens=50)
print(f"batch={batch_size}, ubatch={ubatch_size}: {result.tokens_per_sec:.1f} tok/s")
Advanced Optimizations¶
Parallel Sequences¶
Process multiple sequences in parallel:
engine.load_model(
"model.gguf",
n_parallel=4, # Process 4 sequences simultaneously
gpu_layers=99,
auto_start=True
)
# Submit multiple requests
import concurrent.futures
def infer_async(prompt):
return engine.infer(prompt, max_tokens=50)
prompts = ["Prompt 1", "Prompt 2", "Prompt 3", "Prompt 4"]
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(infer_async, prompts))
# Total throughput: ~500 tok/s with n_parallel=4
Continuous Batching¶
For serving applications:
# Enable continuous batching
engine.load_model(
"model.gguf",
n_parallel=8,
batch_size=512,
ubatch_size=128,
gpu_layers=99,
auto_start=True
)
# Handles variable-length sequences efficiently
# Throughput increases with concurrent requests
Temperature Tuning¶
Balance quality and speed:
# Faster (less sampling)
result = engine.infer(
"Prompt",
temperature=0.1, # Greedy-like
top_k=10, # Limit sampling
max_tokens=100
)
# Speed: ~140 tok/s
# Balanced
result = engine.infer(
"Prompt",
temperature=0.7, # Default
top_k=40,
max_tokens=100
)
# Speed: ~134 tok/s
# Creative (more sampling)
result = engine.infer(
"Prompt",
temperature=1.0,
top_k=100,
max_tokens=100
)
# Speed: ~125 tok/s
Memory Optimization¶
Model Caching¶
Cache models to avoid reloading:
# Keep model in memory between sessions
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf", auto_start=True)
# Reuse engine for multiple inferences
for i in range(1000):
result = engine.infer(f"Prompt {i}", max_tokens=50)
# Don't unload until done
engine.unload_model()
KV Cache Management¶
Control key-value cache:
# Allocate more VRAM for KV cache
engine.load_model(
"model.gguf",
ctx_size=4096, # Context window
cache_size=None, # Auto-calculate
gpu_layers=99
)
# Manual cache control (advanced)
engine.load_model(
"model.gguf",
ctx_size=4096,
cache_size=8192, # 2x context for better caching
gpu_layers=99
)
Profiling and Monitoring¶
Built-in Metrics¶
# Get detailed metrics
metrics = engine.get_metrics()
print("Latency Stats:")
print(f" Mean: {metrics['latency']['mean_ms']:.0f} ms")
print(f" P50: {metrics['latency']['p50_ms']:.0f} ms")
print(f" P95: {metrics['latency']['p95_ms']:.0f} ms")
print(f" P99: {metrics['latency']['p99_ms']:.0f} ms")
print("\nThroughput Stats:")
print(f" Tokens/sec: {metrics['throughput']['tokens_per_sec']:.1f}")
print(f" Requests/sec: {metrics['throughput']['requests_per_sec']:.2f}")
print(f" Total tokens: {metrics['throughput']['total_tokens']}")
GPU Monitoring¶
import subprocess
def monitor_gpu():
result = subprocess.run(
["nvidia-smi", "--query-gpu=utilization.gpu,memory.used", "--format=csv,noheader"],
capture_output=True,
text=True
)
print(f"GPU: {result.stdout.strip()}")
# Monitor during inference
monitor_gpu()
result = engine.infer("Long prompt...", max_tokens=200)
monitor_gpu()
Performance Checklist¶
Use this checklist to ensure optimal performance:
- Quantization: Using Q4_K_M or Q5_K_M
- GPU Offload: gpu_layers=99 (full offload)
- Context Size: Smallest that meets requirements
- Batch Size: 512/128 for 1B models
- FlashAttention: Enabled (automatic on T4)
- CUDA Version: 12.0+
- Driver: Latest NVIDIA driver
- Model Choice: Appropriate size for T4 (1B-3B)
Common Performance Issues¶
Issue: Slow Inference (<50 tok/s)¶
Diagnosis:
metrics = engine.get_metrics()
print(f"Speed: {metrics['throughput']['tokens_per_sec']:.1f} tok/s")
# Check GPU usage
!nvidia-smi
Solutions: 1. Increase GPU layers: gpu_layers=99 2. Use Q4_K_M quantization 3. Reduce context size: ctx_size=2048 4. Check GPU utilization (should be >80%)
Issue: High Latency (>2000ms)¶
Diagnosis:
Solutions: 1. Reduce max_tokens 2. Use smaller context size 3. Check for CPU bottleneck 4. Verify T4 GPU (not CPU-only)
Issue: Out of Memory¶
Diagnosis:
Solutions:
# Reduce GPU layers
gpu_layers = 20 # Instead of 99
# Reduce context
ctx_size = 1024 # Instead of 4096
# Reduce batch size
batch_size = 256 # Instead of 512
Best Configurations¶
Configuration 1: Maximum Speed¶
engine.load_model(
"gemma-3-1b-Q4_K_M.gguf",
gpu_layers=99,
ctx_size=1024,
batch_size=512,
ubatch_size=128,
n_parallel=1,
auto_start=True
)
# Expected: 140+ tok/s
Configuration 2: Balanced¶
engine.load_model(
"gemma-3-1b-Q4_K_M.gguf",
gpu_layers=99,
ctx_size=2048,
batch_size=512,
ubatch_size=128,
n_parallel=1,
auto_start=True
)
# Expected: 134 tok/s (default)
Configuration 3: Long Context¶
engine.load_model(
"gemma-3-1b-Q4_K_M.gguf",
gpu_layers=99,
ctx_size=8192,
batch_size=256,
ubatch_size=64,
n_parallel=1,
auto_start=True
)
# Expected: 105 tok/s with FlashAttention
Configuration 4: Multi-Request¶
engine.load_model(
"gemma-3-1b-Q4_K_M.gguf",
gpu_layers=99,
ctx_size=2048,
batch_size=1024,
ubatch_size=256,
n_parallel=8,
auto_start=True
)
# Expected: 400+ tok/s total throughput
Next Steps¶
- Benchmarks - Compare model performance
- T4 Results - Detailed T4 benchmarks
- Optimization Guide - Advanced tuning
- Troubleshooting - Fix issues
Performance Achieved
Following these optimizations, you should achieve 130+ tok/s on Gemma 3-1B with Tesla T4, matching our verified benchmarks.