Skip to content

Quick Start¶

Get started with llcuda v2.2.0 in 5 minutes on Kaggle dual T4 GPUs.

Level: Beginner | Time: 5 minutes | VRAM Required: 3-5 GB (single T4)

Overview¶

This tutorial covers the essentials:

Installing llcuda v2.2.0
Downloading a GGUF model
Starting the llama-server
Making your first chat completion
Cleaning up resources

Step 1: Install llcuda¶

pip install llcuda

Step 2: Check GPUs¶

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU 0: {torch.cuda.get_device_name(0)}")
if torch.cuda.device_count() > 1:
    print(f"GPU 1: {torch.cuda.get_device_name(1)}")

Step 3: Download Model¶

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-2-2b-it-GGUF",
    filename="gemma-2-2b-it-Q4_K_M.gguf"
)

Step 4: Start Server¶

from llcuda.server import ServerManager, ServerConfig

config = ServerConfig(
    model_path=model_path,
    n_gpu_layers=99,
    flash_attn=True
)

server = ServerManager()
server.start_with_config(config)

Step 5: Make Request¶

from llcuda.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://localhost:8080")
response = client.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=200
)
print(response["choices"][0]["message"]["content"])

Step 6: Cleanup¶

server.stop()

Expected Performance¶

Speed: ~60 tokens/sec (Gemma 2-2B Q4_K_M)
Latency: ~500ms
VRAM: ~3-4 GB

Next Steps¶

Open in Kaggle¶