Skip to content

llcuda v2.2.0: CUDA12 Inference Backend for Unsloth

Version 2.2.0 Python 3.11+ CUDA 12.x Kaggle 2×T4 MIT License

CUDA12-first inference backend for Unsloth with Graphistry network visualization on Kaggle dual Tesla T4 GPUs. Fine-tune with Unsloth → Export to GGUF → Deploy on Kaggle → Visualize with Graphistry.


🚀 What is llcuda v2.2.0?

llcuda is a CUDA 12 inference backend specifically designed for deploying Unsloth-fine-tuned models on Kaggle's dual Tesla T4 GPUs (30GB total VRAM). It provides:

:material-gpu: Dual T4 Architecture

Run on Kaggle's 2× Tesla T4 GPUs (15GB each)

  • Native CUDA tensor-split for multi-GPU
  • Support for 70B models with IQ3_XS quantization
  • FlashAttention for 2-3x faster inference
  • 961MB pre-built CUDA 12.5 binaries

Split-GPU Design

Unique architecture: LLM on GPU 0 + Graphistry on GPU 1

  • GPU 0: llama.cpp server for LLM inference
  • GPU 1: RAPIDS cuGraph + Graphistry visualization
  • Extract knowledge graphs from LLM outputs
  • Visualize millions of nodes and edges

Unsloth Integration

Seamless workflow from training to deployment

  • Fine-tune with Unsloth (2x faster training)
  • Export to GGUF format with save_pretrained_gguf()
  • Deploy with llcuda on Kaggle
  • Complete end-to-end pipeline

Production Ready

Built for Kaggle production workloads

  • OpenAI-compatible API via llama-server
  • 29 quantization formats (K-quants, I-quants)
  • NCCL support for PyTorch distributed
  • Auto-download binaries from GitHub Releases

🔥 Core Architecture

llcuda v2.2.0 implements a unique split-GPU architecture for Kaggle's dual T4 environment:

%%{init: {'theme':'base', 'themeVariables': {'fontSize':'18px'}}}%%
graph TD
    A[GGUF Model<br/>HuggingFace] --> B[llcuda Deployment<br/>Kaggle Dual T4]
    B --> C[GPU 0: llama-server<br/>LLM Inference]
    B --> D[GPU 1: RAPIDS + Graphistry<br/>Analytics & Visualization]
    C --> E[OpenAI-Compatible API<br/>:8090]
    E --> F[Knowledge Extraction<br/>Entity & Relationships]
    F --> D
    D --> G[Interactive Dashboards<br/>Graphistry Cloud]

    style A fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#fff
    style B fill:#FF5722,stroke:#E64A19,stroke-width:3px,color:#fff
    style C fill:#4CAF50,stroke:#388E3C,stroke-width:3px,color:#fff
    style D fill:#2196F3,stroke:#1976D2,stroke-width:3px,color:#fff
    style E fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
    style F fill:#00BCD4,stroke:#0097A7,stroke-width:3px,color:#fff
    style G fill:#03A9F4,stroke:#0288D1,stroke-width:3px,color:#fff

    classDef default font-size:16px,padding:15px

Split-GPU Configuration

┌────────────────────────────────────────────────────────────────────────────┐
│                    KAGGLE DUAL T4 SPLIT-GPU ARCHITECTURE                   │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│     GPU 0: Tesla T4 (15GB)                  GPU 1: Tesla T4 (15GB)         │
│     ┌────────────────────────┐              ┌────────────────────────┐     │
│     │                        │              │                        │     │
│     │   llama-server         │              │   RAPIDS cuDF          │     │
│     │   GGUF Model           │  ─────────>  │   cuGraph              │     │
│     │   LLM Inference        │   extract    │   Graphistry[ai]       │     │
│     │   ~5-12GB VRAM         │   graphs     │   Network Viz          │     │
│     │                        │              │                        │     │
│     └────────────────────────┘              └────────────────────────┘     │
│                                                                            │
│     • tensor-split for multi-GPU          • Millions of nodes/edges        │
│     • FlashAttention enabled              • GPU-accelerated rendering      │
│     • OpenAI API compatible               • Interactive exploration        │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

⚡ Quick Start (5 Minutes)

Get llcuda v2.2.0 running on Kaggle in just 5 minutes!

Step 1: Install llcuda

# On Kaggle notebook
pip install git+https://github.com/llcuda/llcuda.git@v2.2.0

Step 2: Verify Dual T4 Setup

import llcuda
from llcuda.api.multigpu import detect_gpus, print_gpu_info

# Check GPU configuration
gpus = detect_gpus()
print(f"✓ Detected {len(gpus)} GPUs")
print_gpu_info()

# Expected output:
# ✓ Detected 2 GPUs
# GPU 0: Tesla T4 (15.0 GB)
# GPU 1: Tesla T4 (15.0 GB)

Step 3: Run Basic Inference

from llcuda.server import ServerManager, ServerConfig

# Configure for single GPU (GPU 0)
config = ServerConfig(
    model_path="model.gguf",  # Your GGUF model
    n_gpu_layers=99,          # Offload all layers to GPU
    flash_attn=True,          # Enable FlashAttention
)

# Start server
server = ServerManager()
server.start_with_config(config)
server.wait_until_ready()

# OpenAI-compatible client
from llcuda.api import LlamaCppClient

client = LlamaCppClient("http://localhost:8080")
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=200
)

print(response.choices[0].message.content)
import llcuda

engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)

result = engine.infer("Explain quantum computing", max_tokens=200)
print(result.text)

Auto-Download Binaries

CUDA binaries (961 MB) download automatically from GitHub Releases v2.2.0 on first import. Cached for future runs!

Full Installation Guide Kaggle Setup Tutorial


⭐ Key Features of v2.2.0

1. Multi-GPU Inference on Kaggle

Run models up to 70B parameters using both T4 GPUs with native CUDA tensor-split:

from llcuda.api.multigpu import kaggle_t4_dual_config

# Optimized dual T4 configuration
config = kaggle_t4_dual_config(model_size_gb=25)  # For 70B IQ3_XS

print(config.to_cli_args())
# Output: ['-ngl', '-1', '--split-mode', 'layer', '--tensor-split', '0.5,0.5', '-fa']

Supported Model Sizes on Dual T4 (30GB VRAM):

Model Size Quantization VRAM Required Fits Dual T4?
1-3B Q4_K_M 2-3 GB ✅ Single T4
7-8B Q4_K_M 5-6 GB ✅ Single T4
13B Q4_K_M 8-9 GB ✅ Single T4
32-34B Q4_K_M 20-22 GB ✅ Dual T4
70B IQ3_XS 25-27 GB Dual T4

:material-gpu: Multi-GPU Guide


2. Unsloth Fine-Tuning Pipeline

Complete workflow from fine-tuning to deployment:

from unsloth import FastLanguageModel

# Load model for training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Train your model...
trainer.train()
# Export fine-tuned model to GGUF format
model.save_pretrained_gguf(
    "my_finetuned_model",
    tokenizer,
    quantization_method="q4_k_m"  # Recommended for T4
)

# Output: my_finetuned_model-Q4_K_M.gguf
from llcuda.server import ServerManager, ServerConfig

# Deploy on Kaggle dual T4
config = ServerConfig(
    model_path="my_finetuned_model-Q4_K_M.gguf",
    n_gpu_layers=99,
    tensor_split="0.5,0.5",  # Use both GPUs
    flash_attn=True,
)

server = ServerManager()
server.start_with_config(config)

# Now serving at http://localhost:8080 with OpenAI API

Unsloth Integration Guide


3. Split-GPU Architecture with Graphistry

Unique capability: Run LLM inference on GPU 0 while using GPU 1 for RAPIDS/Graphistry visualization

from llcuda.graphistry import SplitGPUManager, GraphWorkload, register_graphistry

# Configure split-GPU setup
manager = SplitGPUManager()
manager.assign_llm(0)
manager.assign_graph(1)

# Graphistry on GPU 1
workload = GraphWorkload(gpu_id=1)
register_graphistry(api=3, protocol="https", server="hub.graphistry.com")

# Run LLM on GPU 0 and visualize graphs on GPU 1

Use Cases: - Extract knowledge graphs from LLM outputs → Visualize with Graphistry - Analyze entity relationships in generated text - Interactive exploration of LLM-generated networks - Real-time graph updates from streaming LLM responses

Graphistry Integration Guide


4. 29 GGUF Quantization Formats

llcuda supports all llama.cpp quantization types:

Best quality-to-size ratio with double quantization:

  • Q4_K_M - 4.8 bpw, best for most models (recommended)
  • Q5_K_M - 5.7 bpw, higher quality
  • Q6_K - 6.6 bpw, near FP16 quality
  • Q8_0 - 8.5 bpw, very high quality

Importance-matrix quantization for 70B models:

  • IQ3_XS - 3.3 bpw, fits 70B on dual T4
  • IQ4_XS - 4.3 bpw, better quality
  • IQ2_XS - 2.3 bpw, extreme compression
  • IQ1_S - 1.6 bpw, smallest possible

Standard quantization types:

  • Q4_0 - 4.5 bpw, legacy format
  • Q5_0 - 5.5 bpw, legacy format
  • Q8_0 - 8.5 bpw, high precision

Unquantized formats:

  • F32 - 32-bit float
  • F16 - 16-bit float
  • BF16 - Brain float 16

GGUF Quantization Guide


📈 Performance Benchmarks

Real Kaggle dual T4 performance metrics:

Model Quantization GPUs Tokens/sec Latency VRAM Usage
Gemma 2-2B Q4_K_M 2× T4 ~60 tok/s - 4 GB
Qwen2.5-7B Q4_K_M 2× T4 ~35 tok/s - 10 GB
Llama-3.1-70B IQ3_XS 2× T4 ~8-12 tok/s - 27 GB
Gemma 3-1B Q4_K_M 1× T4 ~45 tok/s 690ms 3 GB

Performance Optimization

  • Enable FlashAttention for 2-3x speedup
  • Use tensor-split for models >15GB
  • K-quants provide best quality/speed balance
  • I-quants enable 70B models on 30GB VRAM

Full Benchmarks


📓 Tutorial Notebooks (13 Kaggle Notebooks)

Complete tutorial series for Kaggle dual T4 environment - from beginner to advanced visualization.

Core Tutorials (1-10)

# Notebook Open in Kaggle Description Time
01 Quick Start Kaggle 5-minute introduction 5 min
02 Server Setup Kaggle Server configuration & lifecycle 15 min
03 Multi-GPU Kaggle Dual T4 tensor-split 20 min
04 GGUF Quantization Kaggle K-quants, I-quants, parsing 20 min
05 Unsloth Integration Kaggle Fine-tune → GGUF → Deploy 30 min
06 Split-GPU + Graphistry Kaggle LLM + RAPIDS visualization 30 min
07 Knowledge Graph Extraction Kaggle LLM-driven entity & relation graphs 30 min
08 Document Network Analysis Kaggle GPU graph analytics for documents 35 min
09 Large Models (13B+) Kaggle Large models on dual T4 30 min
10 Complete Workflow Kaggle End-to-end production 45 min

Visualization Trilogy (11-13)

# Notebook Open in Kaggle Description Time
11 GGUF Neural Network Visualization Kaggle Complete architecture → dashboards 60 min
12 GGUF Attention Mechanism Explorer Kaggle Q‑K‑V attention analysis 20 min
13 GGUF Token Embedding Visualizer Kaggle 3D embedding space exploration 15 min

View All Tutorials ⭐ Visualization Trilogy


📚 Learning Paths

Choose your path based on experience level:

01 Quick Start → 02 Server Setup → 03 Multi-GPU

Perfect for first-time users. Learn the basics of llcuda on Kaggle.

01 → 02 → 03 → 04 → 05 → 06 → 07 → 10

Complete fundamentals through advanced workflows.

01 → 03 → 08 → 09

Focus on multi-GPU and large model deployment.

01 → 04 → 05 → 10

Complete Unsloth fine-tuning and deployment workflow.

01 → 03 → 04 → 06 → 11 → 12 → 13

RECOMMENDED - Learn architecture visualization with Graphistry. ```

Fine-tuning and deployment pipeline.


✨ What's New in v2.2.0

Major Release Highlights

Positioned as Unsloth Inference Backend

  • llcuda is now the official CUDA12 inference backend for Unsloth
  • Seamless workflow: Unsloth (training) → llcuda (inference)
  • Complete Kaggle dual T4 build notebook included

New Features:

  • Kaggle Dual T4 Build - Complete build notebook for reproducible binaries
  • Split-GPU Architecture - LLM on GPU 0 + Graphistry on GPU 1
  • Multi-GPU Clarification - Native CUDA tensor-split (NOT NCCL)
  • 961MB Binary Package - Pre-built CUDA 12.5 binaries for T4
  • Graphistry Integration - PyGraphistry for knowledge graph visualization
  • 70B Model Support - IQ3_XS quantization for large models
  • FlashAttention All Quants - Enabled for all quantization types

Performance:

Platform GPU Model Tokens/sec
Kaggle 2× T4 Gemma 2-2B Q4_K_M ~60 tok/s
Kaggle 2× T4 Llama 70B IQ3_XS ~12 tok/s

Full Changelog


⚙ Technical Architecture

llcuda v2.2.0 is built on proven technologies:

  • llama.cpp Server


    • Build 7760 (commit 388ce82)
    • OpenAI-compatible API
    • Native CUDA tensor-split
    • FlashAttention support
  • CUDA 12.5


    • SM 7.5 (Turing) targeting
    • cuBLAS acceleration
    • Static linking
    • 961MB binary package
  • Python 3.11+


    • Type-safe APIs
    • Async/await support
    • Modern packaging
    • 62KB pip package
  • RAPIDS + Graphistry


    • cuDF for GPU DataFrames
    • cuGraph for network analysis
    • PyGraphistry visualization
    • Millions of nodes/edges

API Reference

llcuda provides comprehensive Python APIs:

Module Description
llcuda.api.client OpenAI-compatible llama.cpp client
llcuda.api.multigpu Multi-GPU configuration for Kaggle
llcuda.api.gguf GGUF parsing and quantization tools
llcuda.api.nccl NCCL for distributed PyTorch
llcuda.server Server lifecycle management
llcuda.graphistry Graphistry integration helpers

Full API Documentation


🤝 Community & Support


⚖ License

MIT License - Free for commercial and personal use. See LICENSE.


Built with ❤️ by Waqas Muhammad | Powered by llama.cpp | Optimized for Unsloth & Graphistry