Skip to content

Running 70B Models¶

Run 70B parameter models on Kaggle dual T4 (30GB VRAM).

Requirements¶

Quantization: IQ3_XS (3.3 bpw)
VRAM: ~25-27 GB
Strategy: Dual T4 tensor-split

Configuration¶

from llcuda.server import ServerConfig

config = ServerConfig(
    model_path="llama-70b-IQ3_XS.gguf",
    n_gpu_layers=99,
    tensor_split="0.48,0.48",  # Leave headroom
    context_size=2048,          # Smaller context
    batch_size=128,             # Smaller batch
    flash_attn=True,
)

Performance¶

Speed: ~8-12 tokens/sec
Quality: Good with IQ3_XS
VRAM: ~27 GB used

See: Tutorial 09 - Large Models