Skip to content

GPU 0 - LLM Inference

Configure GPU 0 for llama.cpp server.

Setup

from llcuda.server import ServerManager, ServerConfig

config = ServerConfig(
    model_path="model.gguf",
    n_gpu_layers=99,        # All layers on GPU 0
    flash_attn=True,
)

# llama-server uses GPU 0 by default
server = ServerManager()
server.start_with_config(config)

VRAM Usage

Model Quant VRAM on GPU 0
1-3B Q4_K_M 2-4 GB
7B Q4_K_M 5-6 GB
13B Q4_K_M 8-9 GB

Performance

  • FlashAttention: 2-3x speedup
  • Tensor Cores: FP16/TF32 acceleration
  • Context: Up to 8192 tokens