Skip to content

Tensor Split Configuration¶

Understand tensor-split for dual T4 inference.

What is Tensor Split?¶

Native CUDA mechanism to split model layers across GPUs.

NOT NCCL - llama.cpp uses native CUDA, not NCCL.

Configuration¶

config = ServerConfig(
    tensor_split="0.5,0.5",  # 50% GPU 0, 50% GPU 1
    split_mode="layer",       # Split by layers
)

Split Modes¶

layer: Split layers across GPUs (recommended)
row: Split tensor rows (requires special support)

When to Use¶

Models > 15GB (won't fit single T4)
32B+ models with Q4_K_M
70B models with IQ3_XS