llcuda v2.2.0: CUDA12 Inference Backend for Unsloth¶

CUDA12-first inference backend for Unsloth with Graphistry network visualization on Kaggle dual Tesla T4 GPUs. Fine-tune with Unsloth → Export to GGUF → Deploy on Kaggle → Visualize with Graphistry.

What is llcuda v2.2.0?¶

llcuda is a CUDA 12 inference backend specifically designed for deploying Unsloth-fine-tuned models on Kaggle's dual Tesla T4 GPUs (30GB total VRAM). It provides:

:material-gpu: Dual T4 Architecture¶

Run on Kaggle's 2× Tesla T4 GPUs (15GB each)

Native CUDA tensor-split for multi-GPU
Support for 70B models with IQ3_XS quantization
FlashAttention for 2-3x faster inference
961MB pre-built CUDA 12.5 binaries

Split-GPU Design¶

Unique architecture: LLM on GPU 0 + Graphistry on GPU 1

GPU 0: llama.cpp server for LLM inference
GPU 1: RAPIDS cuGraph + Graphistry visualization
Extract knowledge graphs from LLM outputs
Visualize millions of nodes and edges

Unsloth Integration¶

Seamless workflow from training to deployment

Fine-tune with Unsloth (2x faster training)
Export to GGUF format with save_pretrained_gguf()
Deploy with llcuda on Kaggle
Complete end-to-end pipeline

Production Ready¶

Built for Kaggle production workloads

OpenAI-compatible API via llama-server
29 quantization formats (K-quants, I-quants)
NCCL support for PyTorch distributed
Auto-download binaries from GitHub Releases

Core Architecture¶

llcuda v2.2.0 implements a unique split-GPU architecture for Kaggle's dual T4 environment:

%%{init: {'theme':'base', 'themeVariables': {'fontSize':'18px'}}}%%
graph TD
    A[GGUF Model<br/>HuggingFace] --> B[llcuda Deployment<br/>Kaggle Dual T4]
    B --> C[GPU 0: llama-server<br/>LLM Inference]
    B --> D[GPU 1: RAPIDS + Graphistry<br/>Analytics & Visualization]
    C --> E[OpenAI-Compatible API<br/>:8090]
    E --> F[Knowledge Extraction<br/>Entity & Relationships]
    F --> D
    D --> G[Interactive Dashboards<br/>Graphistry Cloud]

    style A fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#fff
    style B fill:#FF5722,stroke:#E64A19,stroke-width:3px,color:#fff
    style C fill:#4CAF50,stroke:#388E3C,stroke-width:3px,color:#fff
    style D fill:#2196F3,stroke:#1976D2,stroke-width:3px,color:#fff
    style E fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
    style F fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
    style G fill:#03A9F4,stroke:#0288D1,stroke-width:3px,color:#fff

    classDef default font-size:16px,padding:15px

Split-GPU Configuration¶

┌────────────────────────────────────────────────────────────────────────────┐
│                    KAGGLE DUAL T4 SPLIT-GPU ARCHITECTURE                   │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│     GPU 0: Tesla T4 (15GB)                  GPU 1: Tesla T4 (15GB)         │
│     ┌────────────────────────┐              ┌────────────────────────┐     │
│     │                        │              │                        │     │
│     │   llama-server         │              │   RAPIDS cuDF          │     │
│     │   GGUF Model           │  ─────────>  │   cuGraph              │     │
│     │   LLM Inference        │   extract    │   Graphistry[ai]       │     │
│     │   ~5-12GB VRAM         │   graphs     │   Network Viz          │     │
│     │                        │              │                        │     │
│     └────────────────────────┘              └────────────────────────┘     │
│                                                                            │
│     • tensor-split for multi-GPU          • Millions of nodes/edges        │
│     • FlashAttention enabled              • GPU-accelerated rendering      │
│     • OpenAI API compatible               • Interactive exploration        │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Quick Start (5 Minutes)¶

Get llcuda v2.2.0 running on Kaggle in just 5 minutes!

Step 1: Install llcuda¶

# On Kaggle notebook
pip install git+https://github.com/llcuda/llcuda.git@v2.2.0

Step 2: Verify Dual T4 Setup¶

import llcuda
from llcuda.api.multigpu import detect_gpus, print_gpu_info

# Check GPU configuration
gpus = detect_gpus()
print(f"✓ Detected {len(gpus)} GPUs")
print_gpu_info()

# Expected output:
# ✓ Detected 2 GPUs
# GPU 0: Tesla T4 (15.0 GB)
# GPU 1: Tesla T4 (15.0 GB)

Step 3: Run Basic Inference¶

LlamaCppClient (OpenAI-style)InferenceEngine (Simpler)

from llcuda.server import ServerManager, ServerConfig

# Configure for single GPU (GPU 0)
config = ServerConfig(
    model_path="model.gguf",  # Your GGUF model
    n_gpu_layers=99,          # Offload all layers to GPU
    flash_attn=True,          # Enable FlashAttention
)

# Start server
server = ServerManager()
server.start_with_config(config)
server.wait_until_ready()

# OpenAI-compatible client
from llcuda.api import LlamaCppClient

client = LlamaCppClient("http://localhost:8080")
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=200
)

print(response.choices[0].message.content)

import llcuda

engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)

result = engine.infer("Explain quantum computing", max_tokens=200)
print(result.text)

Auto-Download Binaries

CUDA binaries (961 MB) download automatically from GitHub Releases v2.2.0 on first import. Cached for future runs!

Full Installation Guide Kaggle Setup Tutorial

Key Features of v2.2.0¶

1. Multi-GPU Inference on Kaggle¶

Run models up to 70B parameters using both T4 GPUs with native CUDA tensor-split:

from llcuda.api.multigpu import kaggle_t4_dual_config

# Optimized dual T4 configuration
config = kaggle_t4_dual_config(model_size_gb=25)  # For 70B IQ3_XS

print(config.to_cli_args())
# Output: ['-ngl', '-1', '--split-mode', 'layer', '--tensor-split', '0.5,0.5', '-fa']

Supported Model Sizes on Dual T4 (30GB VRAM):

Model Size	Quantization	VRAM Required	Fits Dual T4?
1-3B	Q4_K_M	2-3 GB	✅ Single T4
7-8B	Q4_K_M	5-6 GB	✅ Single T4
13B	Q4_K_M	8-9 GB	✅ Single T4
32-34B	Q4_K_M	20-22 GB	✅ Dual T4
70B	IQ3_XS	25-27 GB	✅ Dual T4

:material-gpu: Multi-GPU Guide

2. Unsloth Fine-Tuning Pipeline¶

Complete workflow from fine-tuning to deployment:

Step 1: Fine-Tune with UnslothStep 2: Export to GGUFStep 3: Deploy with llcuda

from unsloth import FastLanguageModel

# Load model for training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Train your model...
trainer.train()

# Export fine-tuned model to GGUF format
model.save_pretrained_gguf(
    "my_finetuned_model",
    tokenizer,
    quantization_method="q4_k_m"  # Recommended for T4
)

# Output: my_finetuned_model-Q4_K_M.gguf

from llcuda.server import ServerManager, ServerConfig

# Deploy on Kaggle dual T4
config = ServerConfig(
    model_path="my_finetuned_model-Q4_K_M.gguf",
    n_gpu_layers=99,
    tensor_split="0.5,0.5",  # Use both GPUs
    flash_attn=True,
)

server = ServerManager()
server.start_with_config(config)

# Now serving at http://localhost:8080 with OpenAI API

Unsloth Integration Guide

3. Split-GPU Architecture with Graphistry¶

Unique capability: Run LLM inference on GPU 0 while using GPU 1 for RAPIDS/Graphistry visualization

from llcuda.graphistry import SplitGPUManager, GraphWorkload, register_graphistry

# Configure split-GPU setup
manager = SplitGPUManager()
manager.assign_llm(0)
manager.assign_graph(1)

# Graphistry on GPU 1
workload = GraphWorkload(gpu_id=1)
register_graphistry(api=3, protocol="https", server="hub.graphistry.com")

# Run LLM on GPU 0 and visualize graphs on GPU 1

Use Cases: - Extract knowledge graphs from LLM outputs → Visualize with Graphistry - Analyze entity relationships in generated text - Interactive exploration of LLM-generated networks - Real-time graph updates from streaming LLM responses

Graphistry Integration Guide

4. 29 GGUF Quantization Formats¶

llcuda supports all llama.cpp quantization types:

K-Quants (Recommended)I-Quants (Best Compression)Legacy QuantsFull Precision

Best quality-to-size ratio with double quantization:

Q4_K_M - 4.8 bpw, best for most models (recommended)
Q5_K_M - 5.7 bpw, higher quality
Q6_K - 6.6 bpw, near FP16 quality
Q8_0 - 8.5 bpw, very high quality

Importance-matrix quantization for 70B models:

IQ3_XS - 3.3 bpw, fits 70B on dual T4
IQ4_XS - 4.3 bpw, better quality
IQ2_XS - 2.3 bpw, extreme compression
IQ1_S - 1.6 bpw, smallest possible

Standard quantization types:

Q4_0 - 4.5 bpw, legacy format
Q5_0 - 5.5 bpw, legacy format
Q8_0 - 8.5 bpw, high precision

Unquantized formats:

F32 - 32-bit float
F16 - 16-bit float
BF16 - Brain float 16

GGUF Quantization Guide

Performance Benchmarks¶

Real Kaggle dual T4 performance metrics:

Model	Quantization	GPUs	Tokens/sec	Latency	VRAM Usage
Gemma 2-2B	Q4_K_M	2× T4	~60 tok/s	-	4 GB
Qwen2.5-7B	Q4_K_M	2× T4	~35 tok/s	-	10 GB
Llama-3.1-70B	IQ3_XS	2× T4	~8-12 tok/s	-	27 GB
Gemma 3-1B	Q4_K_M	1× T4	~45 tok/s	690ms	3 GB

Performance Optimization

Enable FlashAttention for 2-3x speedup
Use tensor-split for models >15GB
K-quants provide best quality/speed balance
I-quants enable 70B models on 30GB VRAM

Full Benchmarks

Tutorial Notebooks (13 Kaggle Notebooks)¶

Complete tutorial series for Kaggle dual T4 environment - from beginner to advanced visualization.

Core Tutorials (1-10)¶

#	Notebook	Description	Time
01	Quick Start	5-minute introduction	5 min
02	Server Setup	Server configuration & lifecycle	15 min
03	Multi-GPU	Dual T4 tensor-split	20 min
04	GGUF Quantization	K-quants, I-quants, parsing	20 min
05	Unsloth Integration	Fine-tune → GGUF → Deploy	30 min
06	Split-GPU + Graphistry	LLM + RAPIDS visualization	30 min
07	Knowledge Graph Extraction	LLM-driven entity & relation graphs	30 min
08	Document Network Analysis	GPU graph analytics for documents	35 min
09	Large Models (13B+)	Large models on dual T4	30 min
10	Complete Workflow	End-to-end production	45 min

Visualization Trilogy (11-13)¶

#	Notebook	Description	Time
11	GGUF Neural Network Visualization	Complete architecture → dashboards	60 min
12	GGUF Attention Mechanism Explorer	Q‑K‑V attention analysis	20 min
13	GGUF Token Embedding Visualizer	3D embedding space exploration	15 min

View All Tutorials Visualization Trilogy

Learning Paths¶

Choose your path based on experience level:

Beginner (1 hour)Intermediate (3 hours)Advanced (2 hours)Unsloth Focus (2 hours)Visualization & Research (2.5 hours)

01 Quick Start → 02 Server Setup → 03 Multi-GPU

Perfect for first-time users. Learn the basics of llcuda on Kaggle.

01 → 02 → 03 → 04 → 05 → 06 → 07 → 10

Complete fundamentals through advanced workflows.

01 → 03 → 08 → 09

Focus on multi-GPU and large model deployment.

01 → 04 → 05 → 10

Complete Unsloth fine-tuning and deployment workflow.

01 → 03 → 04 → 06 → 11 → 12 → 13

RECOMMENDED - Learn architecture visualization with Graphistry. ```

Fine-tuning and deployment pipeline.

What's New in v2.2.0¶

Major Release Highlights

Positioned as Unsloth Inference Backend

llcuda is now the official CUDA12 inference backend for Unsloth
Seamless workflow: Unsloth (training) → llcuda (inference)
Complete Kaggle dual T4 build notebook included

New Features:

Kaggle Dual T4 Build - Complete build notebook for reproducible binaries
Split-GPU Architecture - LLM on GPU 0 + Graphistry on GPU 1
Multi-GPU Clarification - Native CUDA tensor-split (NOT NCCL)
961MB Binary Package - Pre-built CUDA 12.5 binaries for T4
Graphistry Integration - PyGraphistry for knowledge graph visualization
70B Model Support - IQ3_XS quantization for large models
FlashAttention All Quants - Enabled for all quantization types

Performance:

Platform	GPU	Model	Tokens/sec
Kaggle	2× T4	Gemma 2-2B Q4_K_M	~60 tok/s
Kaggle	2× T4	Llama 70B IQ3_XS	~12 tok/s

Full Changelog

Technical Architecture¶

llcuda v2.2.0 is built on proven technologies:

llama.cpp Server
- Build 7760 (commit 388ce82)
- OpenAI-compatible API
- Native CUDA tensor-split
- FlashAttention support
CUDA 12.5
- SM 7.5 (Turing) targeting
- cuBLAS acceleration
- Static linking
- 961MB binary package
Python 3.11+
- Type-safe APIs
- Async/await support
- Modern packaging
- 62KB pip package
RAPIDS + Graphistry
- cuDF for GPU DataFrames
- cuGraph for network analysis
- PyGraphistry visualization
- Millions of nodes/edges

API Reference¶

llcuda provides comprehensive Python APIs:

Module	Description
`llcuda.api.client`	OpenAI-compatible llama.cpp client
`llcuda.api.multigpu`	Multi-GPU configuration for Kaggle
`llcuda.api.gguf`	GGUF parsing and quantization tools
`llcuda.api.nccl`	NCCL for distributed PyTorch
`llcuda.server`	Server lifecycle management
`llcuda.graphistry`	Graphistry integration helpers

Full API Documentation

Community & Support¶

GitHub Repository: github.com/llcuda/llcuda
GitHub Releases: v2.2.0 Download
Bug Reports: GitHub Issues
Email: waqasm86@gmail.com

License¶

MIT License - Free for commercial and personal use. See LICENSE.

Built with ❤️ by Waqas Muhammad | Powered by llama.cpp | Optimized for Unsloth & Graphistry