NCCL Integration¶
NCCL vs tensor-split for distributed workloads.
Overview¶
llcuda uses native CUDA tensor-split (NOT NCCL) for multi-GPU inference.
Key Differences¶
llama-server (llcuda): - Native CUDA layer distribution - NO NCCL required - For LLM inference
PyTorch DDP: - NCCL for distributed training - For fine-tuning
llcuda tensor-split¶
from llcuda.server import ServerConfig
config = ServerConfig(
model_path="model.gguf",
tensor_split="0.5,0.5", # Native CUDA
n_gpu_layers=99
)
PyTorch with NCCL¶
When to Use Each¶
- llcuda tensor-split: Multi-GPU inference
- PyTorch NCCL: Multi-GPU training
Examples¶
See NCCL Tutorial