# llcuda - CUDA12 Inference Backend for Unsloth > llcuda v2.2.0: CUDA 12 inference backend for Unsloth with multi-GPU support on Kaggle dual Tesla T4 (30GB VRAM). Split-GPU architecture: GPU 0 for LLM inference, GPU 1 for Graphistry visualization. Optimized for small GGUF models (1B–5B) on Kaggle dual T4. ## About llcuda llcuda is a CUDA 12 inference backend specifically designed for Unsloth GGUF models on Kaggle's dual Tesla T4 GPU environment. It implements a unique split-GPU architecture where GPU 0 handles LLM inference through llama-server while GPU 1 runs RAPIDS/Graphistry for knowledge graph visualization with millions of nodes and edges. Key Features: - 🚀 **Dual T4 GPU Support**: 2× Tesla T4 (15GB each, 30GB total VRAM) - 🔥 **Split-GPU Architecture**: LLM on GPU 0, Graphistry on GPU 1 - ⚡ **Native CUDA tensor-split**: llama.cpp layer distribution (NOT NCCL) - 🎯 **961MB Binary Package**: llama.cpp build 7760 with FlashAttention - 🔧 **70B Model Support**: Run Llama-70B IQ3_XS on dual T4 - 📦 **29 GGUF Quantization Formats**: K-quants and I-quants - 🚀 **Kaggle Optimized**: 13 tutorial notebooks for dual T4 - 🔄 **Unsloth Integration**: Fine-tune → GGUF export → llcuda deployment ## Getting Started - [Homepage](https://llcuda.github.io/): Official llcuda v2.2.0 documentation - [Installation Guide](https://llcuda.github.io/guides/installation/): Complete setup for Kaggle - [Quick Start Guide](https://llcuda.github.io/guides/quickstart/): 5-minute guide - [GitHub Repository](https://github.com/llcuda/llcuda): Main llcuda project ## Comprehensive Tutorials (Kaggle Notebooks) - [01 - Quick Start](https://llcuda.github.io/tutorials/01-quickstart/): 5-minute introduction to llcuda v2.2.0 - [02 - Server Setup](https://llcuda.github.io/tutorials/02-server-setup/): llama-server configuration and lifecycle - [03 - Multi-GPU Inference](https://llcuda.github.io/tutorials/03-multi-gpu/): Dual T4 tensor-split configuration - [04 - GGUF Quantization](https://llcuda.github.io/tutorials/04-gguf-quantization/): K-quants and I-quants explained - [05 - Unsloth Integration](https://llcuda.github.io/tutorials/05-unsloth-integration/): Fine-tune to deployment workflow - [06 - Split-GPU Graphistry](https://llcuda.github.io/tutorials/06-split-gpu-graphistry/): LLM + Graphistry on separate GPUs - [07 - Knowledge Graph Extraction](https://llcuda.github.io/tutorials/07-openai-api/): LLM-driven entity & relation graphs - [08 - Document Network Analysis](https://llcuda.github.io/tutorials/08-nccl-pytorch/): GPU graph analytics for documents - [09 - Large Models](https://llcuda.github.io/tutorials/09-large-models/): Run 70B models on dual T4 - [10 - Complete Workflow](https://llcuda.github.io/tutorials/10-complete-workflow/): End-to-end production workflow - [11 - GGUF Neural Network Visualization](https://llcuda.github.io/tutorials/11-gguf-neural-network-visualization/): Full architecture dashboards ## Main Repository - [llcuda GitHub Repository](https://github.com/llcuda/llcuda): Main llcuda v2.2.0 source code - [Binary Releases](https://github.com/llcuda/llcuda/releases/tag/v2.2.0): 961MB CUDA 12 binaries - [Kaggle Notebooks](https://github.com/llcuda/llcuda/tree/main/notebooks): 13 tutorial notebooks ## Complete Documentation ### Getting Started Guides - [Quick Start Guide](https://llcuda.github.io/guides/quickstart/): 5-minute setup - [Installation Guide](https://llcuda.github.io/guides/installation/): Kaggle setup instructions - [Kaggle Setup](https://llcuda.github.io/guides/kaggle-setup/): Dual T4 configuration - [First Steps](https://llcuda.github.io/guides/first-steps/): GPU verification and basic usage - [Troubleshooting](https://llcuda.github.io/guides/troubleshooting/): Common issues and solutions - [FAQ](https://llcuda.github.io/guides/faq/): Frequently asked questions ### Advanced Guides - [Model Selection Guide](https://llcuda.github.io/guides/model-selection/): Choose models for dual T4 - [Build from Source](https://llcuda.github.io/guides/build-from-source/): Compile llama.cpp binaries ### API Reference - [API Overview](https://llcuda.github.io/api/overview/): Complete API documentation - [ServerManager API](https://llcuda.github.io/api/server/): Server lifecycle management - [MultiGPU API](https://llcuda.github.io/api/multigpu/): Dual GPU configuration - [LlamaCppClient API](https://llcuda.github.io/api/client/): OpenAI-compatible client - [GGUF Tools API](https://llcuda.github.io/api/gguf/): GGUF parsing and utilities - [Code Examples](https://llcuda.github.io/api/examples/): Working examples ### Kaggle Documentation - [Dual GPU Setup](https://llcuda.github.io/kaggle/dual-gpu-setup/): Configure dual T4 - [Multi-GPU Inference](https://llcuda.github.io/kaggle/multi-gpu-inference/): Tensor-split usage - [Tensor-Split Guide](https://llcuda.github.io/kaggle/tensor-split/): Layer distribution - [Large Models](https://llcuda.github.io/kaggle/large-models/): 70B on dual T4 ### Performance Documentation - [Performance Benchmarks](https://llcuda.github.io/performance/benchmarks/): Dual T4 results - [Dual T4 Results](https://llcuda.github.io/performance/dual-t4-results/): Verified performance - [FlashAttention](https://llcuda.github.io/performance/flash-attention/): 2-3x speedup - [Memory Management](https://llcuda.github.io/performance/memory/): VRAM optimization - [Optimization Guide](https://llcuda.github.io/performance/optimization/): Advanced tuning ## Installation Install llcuda v2.2.0 on Kaggle: ```bash pip install llcuda ``` Binary auto-download: 961 MB CUDA binaries download automatically on first use. ## Key Features - **Dual T4 GPU Support**: 30GB total VRAM (15GB × 2) - **Split-GPU Architecture**: LLM on GPU 0, Graphistry on GPU 1 - **Native tensor-split**: llama.cpp layer distribution - **70B Model Support**: Run Llama-70B IQ3_XS on 30GB VRAM - **29 GGUF Quantization Formats**: K-quants and I-quants - **FlashAttention**: 2-3x speedup for all quantization types - **Kaggle Optimized**: 13 comprehensive tutorial notebooks ## Performance Data Verified Kaggle Dual T4 Performance: - **Gemma 2-2B Q4_K_M**: ~60 tokens/sec - **Llama-3.2-3B Q4_K_M**: ~45 tokens/sec - **Qwen-2.5-7B Q4_K_M**: ~25 tokens/sec - **Llama-70B IQ3_XS**: ~12 tokens/sec (dual T4) ## Split-GPU Architecture llcuda v2.2.0 unique architecture: ``` GPU 0: llama-server (Unsloth GGUF LLM) ↓ Extract knowledge graphs from LLM outputs ↓ GPU 1: RAPIDS cuDF/cuGraph + Graphistry → Visualize millions of nodes/edges ``` ## Documentation Structure - [Home](https://llcuda.github.io/): Main homepage - [Quick Start](https://llcuda.github.io/guides/quickstart/): 5-minute setup - [Installation](https://llcuda.github.io/guides/installation/): Kaggle installation - [Tutorial Index](https://llcuda.github.io/tutorials/index/): 13 Kaggle notebooks - [API Reference](https://llcuda.github.io/api/overview/): Complete API docs ## Key Information **llcuda v2.2.0** is a CUDA 12 inference backend for Unsloth: - **Binary Package**: 961 MB with llama.cpp build 7760 - **Target Platform**: Kaggle dual Tesla T4 (30GB VRAM) - **Verified Performance**: 60 tok/s on Gemma 2-2B, 12 tok/s on Llama-70B - **CUDA Version**: CUDA 12.x - **Optimizations**: FlashAttention, native tensor-split - **Format**: GGUF models (29 quantization types) Remember: - llcuda v2.2.0 is for Kaggle dual Tesla T4 GPUs - Install via: pip install llcuda - Binaries auto-download on first use (961 MB) - Split-GPU: LLM on GPU 0, Graphistry on GPU 1 - 70B models supported with IQ3_XS quantization - 13 comprehensive Kaggle tutorial notebooks ## Kaggle Notebooks llcuda includes 13 comprehensive tutorial notebooks for Kaggle dual T4: ### Beginner (01-03) - [01 - Quick Start](https://www.kaggle.com/code/waqasm86/01-quickstart-llcuda-v2-2-0): 5-minute introduction - [02 - Server Setup](https://www.kaggle.com/code/waqasm86/02-llama-server-setup-llcuda-v2-2-0): Configuration guide - [03 - Multi-GPU](https://www.kaggle.com/code/waqasm86/03-multi-gpu-inference-llcuda-v2-2-0): Dual T4 setup ### Intermediate (04-06) - [04 - GGUF Quantization](https://www.kaggle.com/code/waqasm86/04-gguf-quantization-llcuda-v2-2-0): K-quants and I-quants - [05 - Unsloth Integration](https://www.kaggle.com/code/waqasm86/05-unsloth-integration-llcuda-v2-2-0): Fine-tune workflow - [06 - Split-GPU Graphistry](https://www.kaggle.com/code/waqasm86/06-split-gpu-graphistry-llcuda-v2-2-0): Dual GPU architecture ### Advanced (07-11) - [07 - Knowledge Graph Extraction](https://www.kaggle.com/code/waqasm86/07-knowledge-graph-extraction-graphistry-v2-2-0): LLM-driven entity & relation graphs - [08 - Document Network Analysis](https://www.kaggle.com/code/waqasm86/08-document-network-analysis-graphistry-llcuda-v2-2-0): GPU graph analytics for documents - [09 - Large Models](https://www.kaggle.com/code/waqasm86/09-large-models-kaggle-llcuda-v2-2-0): 70B on dual T4 - [10 - Complete Workflow](https://www.kaggle.com/code/waqasm86/10-complete-workflow-llcuda-v2-2-0): End-to-end guide - [11 - GGUF Neural Network Visualization](https://www.kaggle.com/code/waqasm86/11-gguf-neural-network-graphistry-vis-executed-2): Full architecture dashboards ## GGUF Quantization Support llcuda v2.2.0 supports 29 GGUF quantization formats: **K-Quants** (recommended): - Q4_K_M, Q5_K_M, Q6_K, Q8_0 **I-Quants** (for 70B models): - IQ3_XS, IQ3_XXS, IQ2_XXS ## Graphistry Integration Visualize knowledge graphs extracted from LLM outputs: - GPU 0: Run Unsloth GGUF LLM via llama-server - GPU 1: RAPIDS cuDF/cuGraph + Graphistry - Handle millions of nodes and edges - Real-time graph visualization