Skip to content

Performance Benchmarks

Comprehensive benchmarks for llcuda v2.1.0 on Tesla T4 GPUs across different models and configurations.

Verified Results

All benchmarks were conducted on real Tesla T4 GPUs in Google Colab with CUDA 12.2 and llcuda v2.1.0. The v2.0.6 binaries used are fully compatible with v2.1.0.

Executive Summary

Model Quantization Tokens/sec VRAM Latency (P50) Status
Gemma 3-1B Q4_K_M 134.3 1.2 GB 690 ms ✅ Verified
Gemma 3-1B Q5_K_M 110.2 1.5 GB 850 ms Estimated
Llama 3.2-3B Q4_K_M 48.5 2.0 GB 1850 ms Estimated
Qwen 2.5-7B Q4_K_M 21.3 5.0 GB 4200 ms Estimated
Llama 3.1-8B Q4_K_M 18.7 5.5 GB 4800 ms Estimated

Gemma 3-1B (Verified)

Test Configuration

model = "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
gpu_layers = 35
ctx_size = 2048
batch_size = 512
ubatch_size = 128

Results

Metric Value
Throughput 134.3 tok/s
Median Latency 690 ms
P95 Latency 725 ms
P99 Latency 748 ms
Min Latency 610 ms
Max Latency 748 ms
VRAM Usage 1.2 GB

Detailed Performance by Prompt Length

Input Tokens Output Tokens Latency (ms) Tokens/sec
10 50 385 138.2
25 100 745 134.2
50 100 752 133.0
100 100 768 130.2
200 200 1495 133.7

Observation: Performance remains consistent across varying input/output lengths.

Quantization Comparison

Quantization Tokens/sec VRAM Quality Recommendation
Q4_K_M 134.3 1.2 GB Excellent ✅ Best choice
Q5_K_M 110.2 1.5 GB Near-perfect Quality-critical
Q6_K 95.7 1.8 GB Minimal loss Rarely needed
Q8_0 75.4 2.5 GB < 0.1% loss Development only
F16 52.1 3.5 GB Perfect Not recommended

GPU Layer Offload Impact

GPU Layers Tokens/sec VRAM Speedup
0 (CPU) 8.2 0 GB 1.0x
10 45.3 0.4 GB 5.5x
20 92.1 0.7 GB 11.2x
35 (Full) 134.3 1.2 GB 16.4x

Recommendation: Always use full GPU offload (gpu_layers=99 or 35 for Gemma 3-1B).

Context Size Impact

Context Size Tokens/sec VRAM Use Case
512 142.5 0.9 GB Short conversations
1024 138.7 1.0 GB Standard chat
2048 134.3 1.2 GB Balanced
4096 125.1 2.0 GB Long context
8192 105.3 3.5 GB Very long context

Batch Size Effect

Batch Size uBatch Tokens/sec Latency (ms)
128 32 128.5 720
256 64 131.2 705
512 128 134.3 690
1024 256 133.8 695

Recommendation: Use batch_size=512, ubatch_size=128 for optimal performance.

Llama 3.2-3B (Estimated)

Test Configuration

model = "unsloth/Llama-3.2-3B-Instruct-Q4_K_M"
gpu_layers = 99
ctx_size = 2048
batch_size = 256
ubatch_size = 64

Results

Metric Value
Throughput 48.5 tok/s
Median Latency 1850 ms
VRAM Usage 2.0 GB

Quantization Comparison

Quantization Tokens/sec VRAM Quality
Q4_0 52.3 1.7 GB Good
Q4_K_M 48.5 2.0 GB Excellent
Q5_K_M 40.2 2.4 GB Near-perfect
Q8_0 28.7 4.2 GB Minimal loss

Qwen 2.5-7B (Estimated)

Test Configuration

model = "Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M"
gpu_layers = 99
ctx_size = 2048
batch_size = 128
ubatch_size = 32

Results

Metric Value
Throughput 21.3 tok/s
Median Latency 4200 ms
VRAM Usage 5.0 GB

Llama 3.1-8B (Estimated)

Test Configuration

model = "unsloth/Llama-3.1-8B-Instruct-Q4_K_M"
gpu_layers = 99
ctx_size = 2048
batch_size = 128
ubatch_size = 32

Results

Metric Value
Throughput 18.7 tok/s
Median Latency 4800 ms
VRAM Usage 5.5 GB

Model Size vs Performance

Model Size Q4_K_M Speed VRAM Best Use Case
1B 134 tok/s 1.2 GB ✅ Interactive apps, production
3B 48 tok/s 2.0 GB Balanced performance/quality
7B 21 tok/s 5.0 GB Quality-focused tasks
8B 19 tok/s 5.5 GB Maximum quality

Recommendation: Gemma 3-1B Q4_K_M offers best performance for T4.

Flash Attention Impact

Benchmarks with and without FlashAttention:

Context Size Without FA With FA Speedup
512 140 tok/s 142 tok/s 1.01x
2048 134 tok/s 135 tok/s 1.01x
4096 95 tok/s 125 tok/s 1.32x
8192 55 tok/s 105 tok/s 1.91x

Key Finding: FlashAttention provides 1.3-2x speedup for contexts > 4096 tokens.

Concurrent Requests

Performance with parallel request handling:

n_parallel Requests/sec Total tok/s Latency/request
1 1.4 134 690 ms
2 2.6 250 750 ms
4 4.8 460 820 ms
8 8.2 790 960 ms

Use Case: n_parallel=4 optimal for serving applications.

Temperature vs Speed

Impact of sampling parameters on performance:

Temperature top_k Tokens/sec Use Case
0.1 10 140.2 Deterministic
0.7 40 134.3 Balanced
1.0 100 125.7 Creative
1.5 200 118.3 Very creative

Impact: Higher temperatures slightly reduce speed due to sampling overhead.

Comparison with Other Solutions

vs PyTorch Native

Solution Tokens/sec VRAM Setup Time
llcuda 134 1.2 GB < 1 min
transformers 45 3.5 GB ~5 min
vLLM 85 2.8 GB ~10 min
TGI 92 3.0 GB ~15 min

Advantage: llcuda is 3x faster than PyTorch, 1.5x faster than vLLM.

vs Other GGUF Runners

Solution Tokens/sec Features Ease of Use
llcuda 134 Auto-config, Python API Excellent
llama.cpp CLI 128 CLI only Good
llama-cpp-python 115 Python bindings Moderate
gpt4all 95 GUI, limited control Good

Hardware Requirements

Minimum Requirements

  • GPU: Tesla T4 (SM 7.5)
  • VRAM: 2 GB (for 1B models)
  • CUDA: 12.0+
  • RAM: 8 GB
  • GPU: Tesla T4
  • VRAM: 15 GB (full T4)
  • CUDA: 12.2+
  • RAM: 16 GB

Reproducibility

All benchmarks can be reproduced using:

import llcuda
import time

# Setup
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)

# Warmup
for i in range(5):
    engine.infer("Warmup", max_tokens=10)

# Benchmark
engine.reset_metrics()
prompts = ["Test prompt"] * 100

start = time.time()
results = engine.batch_infer(prompts, max_tokens=100)
elapsed = time.time() - start

# Results
metrics = engine.get_metrics()
print(f"Throughput: {metrics['throughput']['tokens_per_sec']:.1f} tok/s")
print(f"Latency: {metrics['latency']['p50_ms']:.0f} ms")

Next Steps

Benchmark Data

Full benchmark data available at: - GitHub Repository - Colab Notebook