llcuda v2.2.0: CUDA12 Inference Backend for Unsloth¶
CUDA12-first inference backend for Unsloth with Graphistry network visualization on Kaggle dual Tesla T4 GPUs. Fine-tune with Unsloth → Export to GGUF → Deploy on Kaggle → Visualize with Graphistry.
What is llcuda v2.2.0?¶
llcuda is a CUDA 12 inference backend specifically designed for deploying Unsloth-fine-tuned models on Kaggle's dual Tesla T4 GPUs (30GB total VRAM). It provides:
:material-gpu: Dual T4 Architecture¶
Run on Kaggle's 2× Tesla T4 GPUs (15GB each)
- Native CUDA tensor-split for multi-GPU
- Support for 70B models with IQ3_XS quantization
- FlashAttention for 2-3x faster inference
- 961MB pre-built CUDA 12.5 binaries
Split-GPU Design¶
Unique architecture: LLM on GPU 0 + Graphistry on GPU 1
- GPU 0: llama.cpp server for LLM inference
- GPU 1: RAPIDS cuGraph + Graphistry visualization
- Extract knowledge graphs from LLM outputs
- Visualize millions of nodes and edges
Unsloth Integration¶
Seamless workflow from training to deployment
- Fine-tune with Unsloth (2x faster training)
- Export to GGUF format with
save_pretrained_gguf() - Deploy with llcuda on Kaggle
- Complete end-to-end pipeline
Production Ready¶
Built for Kaggle production workloads
- OpenAI-compatible API via llama-server
- 29 quantization formats (K-quants, I-quants)
- NCCL support for PyTorch distributed
- Auto-download binaries from GitHub Releases
Core Architecture¶
llcuda v2.2.0 implements a unique split-GPU architecture for Kaggle's dual T4 environment:
%%{init: {'theme':'base', 'themeVariables': {'fontSize':'18px'}}}%%
graph TD
A[GGUF Model<br/>HuggingFace] --> B[llcuda Deployment<br/>Kaggle Dual T4]
B --> C[GPU 0: llama-server<br/>LLM Inference]
B --> D[GPU 1: RAPIDS + Graphistry<br/>Analytics & Visualization]
C --> E[OpenAI-Compatible API<br/>:8090]
E --> F[Knowledge Extraction<br/>Entity & Relationships]
F --> D
D --> G[Interactive Dashboards<br/>Graphistry Cloud]
style A fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#fff
style B fill:#FF5722,stroke:#E64A19,stroke-width:3px,color:#fff
style C fill:#4CAF50,stroke:#388E3C,stroke-width:3px,color:#fff
style D fill:#2196F3,stroke:#1976D2,stroke-width:3px,color:#fff
style E fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
style F fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
style G fill:#03A9F4,stroke:#0288D1,stroke-width:3px,color:#fff
classDef default font-size:16px,padding:15px Split-GPU Configuration¶
┌────────────────────────────────────────────────────────────────────────────┐ │ KAGGLE DUAL T4 SPLIT-GPU ARCHITECTURE │ ├────────────────────────────────────────────────────────────────────────────┤ │ │ │ GPU 0: Tesla T4 (15GB) GPU 1: Tesla T4 (15GB) │ │ ┌────────────────────────┐ ┌────────────────────────┐ │ │ │ │ │ │ │ │ │ llama-server │ │ RAPIDS cuDF │ │ │ │ GGUF Model │ ─────────> │ cuGraph │ │ │ │ LLM Inference │ extract │ Graphistry[ai] │ │ │ │ ~5-12GB VRAM │ graphs │ Network Viz │ │ │ │ │ │ │ │ │ └────────────────────────┘ └────────────────────────┘ │ │ │ │ • tensor-split for multi-GPU • Millions of nodes/edges │ │ • FlashAttention enabled • GPU-accelerated rendering │ │ • OpenAI API compatible • Interactive exploration │ │ │ └────────────────────────────────────────────────────────────────────────────┘
Quick Start (5 Minutes)¶
Get llcuda v2.2.0 running on Kaggle in just 5 minutes!
Step 1: Install llcuda¶
Step 2: Verify Dual T4 Setup¶
import llcuda
from llcuda.api.multigpu import detect_gpus, print_gpu_info
# Check GPU configuration
gpus = detect_gpus()
print(f"✓ Detected {len(gpus)} GPUs")
print_gpu_info()
# Expected output:
# ✓ Detected 2 GPUs
# GPU 0: Tesla T4 (15.0 GB)
# GPU 1: Tesla T4 (15.0 GB)
Step 3: Run Basic Inference¶
from llcuda.server import ServerManager, ServerConfig
# Configure for single GPU (GPU 0)
config = ServerConfig(
model_path="model.gguf", # Your GGUF model
n_gpu_layers=99, # Offload all layers to GPU
flash_attn=True, # Enable FlashAttention
)
# Start server
server = ServerManager()
server.start_with_config(config)
server.wait_until_ready()
# OpenAI-compatible client
from llcuda.api import LlamaCppClient
client = LlamaCppClient("http://localhost:8080")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Explain quantum computing"}],
max_tokens=200
)
print(response.choices[0].message.content)
Auto-Download Binaries
CUDA binaries (961 MB) download automatically from GitHub Releases v2.2.0 on first import. Cached for future runs!
Full Installation Guide Kaggle Setup Tutorial
Key Features of v2.2.0¶
1. Multi-GPU Inference on Kaggle¶
Run models up to 70B parameters using both T4 GPUs with native CUDA tensor-split:
from llcuda.api.multigpu import kaggle_t4_dual_config
# Optimized dual T4 configuration
config = kaggle_t4_dual_config(model_size_gb=25) # For 70B IQ3_XS
print(config.to_cli_args())
# Output: ['-ngl', '-1', '--split-mode', 'layer', '--tensor-split', '0.5,0.5', '-fa']
Supported Model Sizes on Dual T4 (30GB VRAM):
| Model Size | Quantization | VRAM Required | Fits Dual T4? |
|---|---|---|---|
| 1-3B | Q4_K_M | 2-3 GB | ✅ Single T4 |
| 7-8B | Q4_K_M | 5-6 GB | ✅ Single T4 |
| 13B | Q4_K_M | 8-9 GB | ✅ Single T4 |
| 32-34B | Q4_K_M | 20-22 GB | ✅ Dual T4 |
| 70B | IQ3_XS | 25-27 GB | ✅ Dual T4 |
:material-gpu: Multi-GPU Guide
2. Unsloth Fine-Tuning Pipeline¶
Complete workflow from fine-tuning to deployment:
from unsloth import FastLanguageModel
# Load model for training
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-1.5B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj"],
lora_alpha=16,
lora_dropout=0,
)
# Train your model...
trainer.train()
from llcuda.server import ServerManager, ServerConfig
# Deploy on Kaggle dual T4
config = ServerConfig(
model_path="my_finetuned_model-Q4_K_M.gguf",
n_gpu_layers=99,
tensor_split="0.5,0.5", # Use both GPUs
flash_attn=True,
)
server = ServerManager()
server.start_with_config(config)
# Now serving at http://localhost:8080 with OpenAI API
3. Split-GPU Architecture with Graphistry¶
Unique capability: Run LLM inference on GPU 0 while using GPU 1 for RAPIDS/Graphistry visualization
from llcuda.graphistry import SplitGPUManager, GraphWorkload, register_graphistry
# Configure split-GPU setup
manager = SplitGPUManager()
manager.assign_llm(0)
manager.assign_graph(1)
# Graphistry on GPU 1
workload = GraphWorkload(gpu_id=1)
register_graphistry(api=3, protocol="https", server="hub.graphistry.com")
# Run LLM on GPU 0 and visualize graphs on GPU 1
Use Cases: - Extract knowledge graphs from LLM outputs → Visualize with Graphistry - Analyze entity relationships in generated text - Interactive exploration of LLM-generated networks - Real-time graph updates from streaming LLM responses
4. 29 GGUF Quantization Formats¶
llcuda supports all llama.cpp quantization types:
Best quality-to-size ratio with double quantization:
- Q4_K_M - 4.8 bpw, best for most models (recommended)
- Q5_K_M - 5.7 bpw, higher quality
- Q6_K - 6.6 bpw, near FP16 quality
- Q8_0 - 8.5 bpw, very high quality
Importance-matrix quantization for 70B models:
- IQ3_XS - 3.3 bpw, fits 70B on dual T4
- IQ4_XS - 4.3 bpw, better quality
- IQ2_XS - 2.3 bpw, extreme compression
- IQ1_S - 1.6 bpw, smallest possible
Standard quantization types:
- Q4_0 - 4.5 bpw, legacy format
- Q5_0 - 5.5 bpw, legacy format
- Q8_0 - 8.5 bpw, high precision
Unquantized formats:
- F32 - 32-bit float
- F16 - 16-bit float
- BF16 - Brain float 16
Performance Benchmarks¶
Real Kaggle dual T4 performance metrics:
| Model | Quantization | GPUs | Tokens/sec | Latency | VRAM Usage |
|---|---|---|---|---|---|
| Gemma 2-2B | Q4_K_M | 2× T4 | ~60 tok/s | - | 4 GB |
| Qwen2.5-7B | Q4_K_M | 2× T4 | ~35 tok/s | - | 10 GB |
| Llama-3.1-70B | IQ3_XS | 2× T4 | ~8-12 tok/s | - | 27 GB |
| Gemma 3-1B | Q4_K_M | 1× T4 | ~45 tok/s | 690ms | 3 GB |
Performance Optimization
- Enable FlashAttention for 2-3x speedup
- Use tensor-split for models >15GB
- K-quants provide best quality/speed balance
- I-quants enable 70B models on 30GB VRAM
Tutorial Notebooks (13 Kaggle Notebooks)¶
Complete tutorial series for Kaggle dual T4 environment - from beginner to advanced visualization.
Core Tutorials (1-10)¶
| # | Notebook | Open in Kaggle | Description | Time |
|---|---|---|---|---|
| 01 | Quick Start | 5-minute introduction | 5 min | |
| 02 | Server Setup | Server configuration & lifecycle | 15 min | |
| 03 | Multi-GPU | Dual T4 tensor-split | 20 min | |
| 04 | GGUF Quantization | K-quants, I-quants, parsing | 20 min | |
| 05 | Unsloth Integration | Fine-tune → GGUF → Deploy | 30 min | |
| 06 | Split-GPU + Graphistry | LLM + RAPIDS visualization | 30 min | |
| 07 | Knowledge Graph Extraction | LLM-driven entity & relation graphs | 30 min | |
| 08 | Document Network Analysis | GPU graph analytics for documents | 35 min | |
| 09 | Large Models (13B+) | Large models on dual T4 | 30 min | |
| 10 | Complete Workflow | End-to-end production | 45 min |
Visualization Trilogy (11-13)¶
| # | Notebook | Open in Kaggle | Description | Time |
|---|---|---|---|---|
| 11 | GGUF Neural Network Visualization | Complete architecture → dashboards | 60 min | |
| 12 | GGUF Attention Mechanism Explorer | Q‑K‑V attention analysis | 20 min | |
| 13 | GGUF Token Embedding Visualizer | 3D embedding space exploration | 15 min |
View All Tutorials Visualization Trilogy
Learning Paths¶
Choose your path based on experience level:
Perfect for first-time users. Learn the basics of llcuda on Kaggle.
What's New in v2.2.0¶
Major Release Highlights
Positioned as Unsloth Inference Backend
- llcuda is now the official CUDA12 inference backend for Unsloth
- Seamless workflow: Unsloth (training) → llcuda (inference)
- Complete Kaggle dual T4 build notebook included
New Features:
- Kaggle Dual T4 Build - Complete build notebook for reproducible binaries
- Split-GPU Architecture - LLM on GPU 0 + Graphistry on GPU 1
- Multi-GPU Clarification - Native CUDA tensor-split (NOT NCCL)
- 961MB Binary Package - Pre-built CUDA 12.5 binaries for T4
- Graphistry Integration - PyGraphistry for knowledge graph visualization
- 70B Model Support - IQ3_XS quantization for large models
- FlashAttention All Quants - Enabled for all quantization types
Performance:
| Platform | GPU | Model | Tokens/sec |
|---|---|---|---|
| Kaggle | 2× T4 | Gemma 2-2B Q4_K_M | ~60 tok/s |
| Kaggle | 2× T4 | Llama 70B IQ3_XS | ~12 tok/s |
Technical Architecture¶
llcuda v2.2.0 is built on proven technologies:
-
llama.cpp Server
- Build 7760 (commit 388ce82)
- OpenAI-compatible API
- Native CUDA tensor-split
- FlashAttention support
-
CUDA 12.5
- SM 7.5 (Turing) targeting
- cuBLAS acceleration
- Static linking
- 961MB binary package
-
Python 3.11+
- Type-safe APIs
- Async/await support
- Modern packaging
- 62KB pip package
-
RAPIDS + Graphistry
- cuDF for GPU DataFrames
- cuGraph for network analysis
- PyGraphistry visualization
- Millions of nodes/edges
API Reference¶
llcuda provides comprehensive Python APIs:
| Module | Description |
|---|---|
llcuda.api.client | OpenAI-compatible llama.cpp client |
llcuda.api.multigpu | Multi-GPU configuration for Kaggle |
llcuda.api.gguf | GGUF parsing and quantization tools |
llcuda.api.nccl | NCCL for distributed PyTorch |
llcuda.server | Server lifecycle management |
llcuda.graphistry | Graphistry integration helpers |
Community & Support¶
- GitHub Repository: github.com/llcuda/llcuda
- GitHub Releases: v2.2.0 Download
- Bug Reports: GitHub Issues
- Email: waqasm86@gmail.com
License¶
MIT License - Free for commercial and personal use. See LICENSE.