Skip to content

Deployment Pipeline¶

Deploy Unsloth models with llcuda.

Complete Pipeline¶

1. Fine-Tune (Unsloth)¶

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(...)
# ... training ...

2. Export (Unsloth)¶

model.save_pretrained_gguf(
    "my_model",
    tokenizer,
    quantization_method="q4_k_m"
)

3. Deploy (llcuda)¶

from llcuda.server import ServerManager, ServerConfig

config = ServerConfig(
    model_path="my_model-Q4_K_M.gguf",
    n_gpu_layers=99,
    tensor_split="0.5,0.5",  # Dual T4
    flash_attn=True,
)

server = ServerManager()
server.start_with_config(config)

4. Serve (OpenAI API)¶

from llcuda.api import LlamaCppClient

client = LlamaCppClient()
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
)

Production Checklist¶