Server Setup¶
Deep dive into llama-server configuration and lifecycle management.
Level: Beginner | Time: 15 minutes | VRAM Required: 5-8 GB (single T4)
ServerConfig Parameters¶
from llcuda.server import ServerConfig
config = ServerConfig(
model_path="model.gguf",
n_gpu_layers=99,
context_size=4096,
n_batch=2048,
flash_attn=True,
tensor_split=None, # Single GPU
host="127.0.0.1",
port=8080
)
Server Lifecycle¶
from llcuda.server import ServerManager
server = ServerManager()
# Start
server.start_with_config(config)
# Check status
print(f"Running: {server.is_running()}")
print(f"URL: {server.get_base_url()}")
# Wait for ready
server.wait_until_ready(timeout=30)
# Get logs
logs = server.get_logs()
# Stop
server.stop()
Multi-GPU Configuration¶
config = ServerConfig(
model_path="model.gguf",
tensor_split="0.5,0.5", # 50/50 split
split_mode="layer",
n_gpu_layers=99,
flash_attn=True
)