Unsloth Integration¶
Complete workflow: Fine-tune with Unsloth → Export GGUF → Deploy with llcuda.
Level: Intermediate | Time: 30 minutes | VRAM Required: 10-15 GB
Workflow Overview¶
Step 1: Fine-tune with Unsloth¶
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-2-2b-it",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Fine-tune (add your training code here)
Step 2: Export to GGUF¶
# Save as GGUF Q4_K_M
model.save_pretrained_gguf(
"output_model",
tokenizer,
quantization_method="q4_k_m"
)
Step 3: Deploy with llcuda¶
from llcuda.server import ServerManager, ServerConfig
config = ServerConfig(
model_path="output_model/model-Q4_K_M.gguf",
n_gpu_layers=99,
flash_attn=True
)
server = ServerManager()
server.start_with_config(config)