Optimization Guide¶

Optimize llcuda performance on Kaggle.

1. Enable FlashAttention¶

config = ServerConfig(
    flash_attn=True,  # 2-3x speedup
)

2. Optimize Batch Size¶

config = ServerConfig(
    batch_size=2048,   # Larger for throughput
    ubatch_size=512,   # Smaller for latency
)

3. Tune Context Size¶

# Smaller context = faster
config = ServerConfig(
    context_size=2048,  # vs 8192
)

4. Use K-Quants¶

Q4_K_M: Best balance
Q5_K_M: Higher quality
IQ3_XS: For 70B models

5. Monitor VRAM¶

from llcuda.api.multigpu import detect_gpus

gpus = detect_gpus()
for gpu in gpus:
    print(f"GPU {gpu.id}: {gpu.memory_used_gb:.1f} / {gpu.memory_total_gb:.1f} GB")