# llcuda - CUDA12 Inference Backend for Unsloth

> llcuda v2.2.0: CUDA 12 inference backend for Unsloth with multi-GPU support on Kaggle dual Tesla T4 (30GB VRAM). Split-GPU architecture: GPU 0 for LLM inference, GPU 1 for Graphistry visualization. Optimized for small GGUF models (1B–5B) on Kaggle dual T4.

## About llcuda

llcuda is a CUDA 12 inference backend specifically designed for Unsloth GGUF models on Kaggle's dual Tesla T4 GPU environment. It implements a unique split-GPU architecture where GPU 0 handles LLM inference through llama-server while GPU 1 runs RAPIDS/Graphistry for knowledge graph visualization with millions of nodes and edges.

Key Features:
- 🚀 **Dual T4 GPU Support**: 2× Tesla T4 (15GB each, 30GB total VRAM)
- 🔥 **Split-GPU Architecture**: LLM on GPU 0, Graphistry on GPU 1
- ⚡ **Native CUDA tensor-split**: llama.cpp layer distribution (NOT NCCL)
- 🎯 **961MB Binary Package**: llama.cpp build 7760 with FlashAttention
- 🔧 **70B Model Support**: Run Llama-70B IQ3_XS on dual T4
- 📦 **29 GGUF Quantization Formats**: K-quants and I-quants
- 🚀 **Kaggle Optimized**: 13 tutorial notebooks for dual T4
- 🔄 **Unsloth Integration**: Fine-tune → GGUF export → llcuda deployment

## Getting Started

- [Homepage](https://llcuda.github.io/): Official llcuda v2.2.0 documentation
- [Installation Guide](https://llcuda.github.io/guides/installation/): Complete setup for Kaggle
- [Quick Start Guide](https://llcuda.github.io/guides/quickstart/): 5-minute guide
- [GitHub Repository](https://github.com/llcuda/llcuda): Main llcuda project

## Comprehensive Tutorials (Kaggle Notebooks)

- [01 - Quick Start](https://llcuda.github.io/tutorials/01-quickstart/): 5-minute introduction to llcuda v2.2.0
- [02 - Server Setup](https://llcuda.github.io/tutorials/02-server-setup/): llama-server configuration and lifecycle
- [03 - Multi-GPU Inference](https://llcuda.github.io/tutorials/03-multi-gpu/): Dual T4 tensor-split configuration
- [04 - GGUF Quantization](https://llcuda.github.io/tutorials/04-gguf-quantization/): K-quants and I-quants explained
- [05 - Unsloth Integration](https://llcuda.github.io/tutorials/05-unsloth-integration/): Fine-tune to deployment workflow
- [06 - Split-GPU Graphistry](https://llcuda.github.io/tutorials/06-split-gpu-graphistry/): LLM + Graphistry on separate GPUs
- [07 - Knowledge Graph Extraction](https://llcuda.github.io/tutorials/07-openai-api/): LLM-driven entity & relation graphs
- [08 - Document Network Analysis](https://llcuda.github.io/tutorials/08-nccl-pytorch/): GPU graph analytics for documents
- [09 - Large Models](https://llcuda.github.io/tutorials/09-large-models/): Run 70B models on dual T4
- [10 - Complete Workflow](https://llcuda.github.io/tutorials/10-complete-workflow/): End-to-end production workflow
- [11 - GGUF Neural Network Visualization](https://llcuda.github.io/tutorials/11-gguf-neural-network-visualization/): Full architecture dashboards

## Main Repository

- [llcuda GitHub Repository](https://github.com/llcuda/llcuda): Main llcuda v2.2.0 source code
- [Binary Releases](https://github.com/llcuda/llcuda/releases/tag/v2.2.0): 961MB CUDA 12 binaries
- [Kaggle Notebooks](https://github.com/llcuda/llcuda/tree/main/notebooks): 13 tutorial notebooks

## Complete Documentation

### Getting Started Guides
- [Quick Start Guide](https://llcuda.github.io/guides/quickstart/): 5-minute setup
- [Installation Guide](https://llcuda.github.io/guides/installation/): Kaggle setup instructions
- [Kaggle Setup](https://llcuda.github.io/guides/kaggle-setup/): Dual T4 configuration
- [First Steps](https://llcuda.github.io/guides/first-steps/): GPU verification and basic usage
- [Troubleshooting](https://llcuda.github.io/guides/troubleshooting/): Common issues and solutions
- [FAQ](https://llcuda.github.io/guides/faq/): Frequently asked questions

### Advanced Guides
- [Model Selection Guide](https://llcuda.github.io/guides/model-selection/): Choose models for dual T4
- [Build from Source](https://llcuda.github.io/guides/build-from-source/): Compile llama.cpp binaries

### API Reference
- [API Overview](https://llcuda.github.io/api/overview/): Complete API documentation
- [ServerManager API](https://llcuda.github.io/api/server/): Server lifecycle management
- [MultiGPU API](https://llcuda.github.io/api/multigpu/): Dual GPU configuration
- [LlamaCppClient API](https://llcuda.github.io/api/client/): OpenAI-compatible client
- [GGUF Tools API](https://llcuda.github.io/api/gguf/): GGUF parsing and utilities
- [Code Examples](https://llcuda.github.io/api/examples/): Working examples

### Kaggle Documentation
- [Dual GPU Setup](https://llcuda.github.io/kaggle/dual-gpu-setup/): Configure dual T4
- [Multi-GPU Inference](https://llcuda.github.io/kaggle/multi-gpu-inference/): Tensor-split usage
- [Tensor-Split Guide](https://llcuda.github.io/kaggle/tensor-split/): Layer distribution
- [Large Models](https://llcuda.github.io/kaggle/large-models/): 70B on dual T4

### Performance Documentation
- [Performance Benchmarks](https://llcuda.github.io/performance/benchmarks/): Dual T4 results
- [Dual T4 Results](https://llcuda.github.io/performance/dual-t4-results/): Verified performance
- [FlashAttention](https://llcuda.github.io/performance/flash-attention/): 2-3x speedup
- [Memory Management](https://llcuda.github.io/performance/memory/): VRAM optimization
- [Optimization Guide](https://llcuda.github.io/performance/optimization/): Advanced tuning

## Installation

Install llcuda v2.2.0 on Kaggle:

```bash
pip install llcuda
```

Binary auto-download: 961 MB CUDA binaries download automatically on first use.

## Key Features

- **Dual T4 GPU Support**: 30GB total VRAM (15GB × 2)
- **Split-GPU Architecture**: LLM on GPU 0, Graphistry on GPU 1
- **Native tensor-split**: llama.cpp layer distribution
- **70B Model Support**: Run Llama-70B IQ3_XS on 30GB VRAM
- **29 GGUF Quantization Formats**: K-quants and I-quants
- **FlashAttention**: 2-3x speedup for all quantization types
- **Kaggle Optimized**: 13 comprehensive tutorial notebooks

## Performance Data

Verified Kaggle Dual T4 Performance:
- **Gemma 2-2B Q4_K_M**: ~60 tokens/sec
- **Llama-3.2-3B Q4_K_M**: ~45 tokens/sec
- **Qwen-2.5-7B Q4_K_M**: ~25 tokens/sec
- **Llama-70B IQ3_XS**: ~12 tokens/sec (dual T4)

## Split-GPU Architecture

llcuda v2.2.0 unique architecture:

```
GPU 0: llama-server (Unsloth GGUF LLM)
  ↓
  Extract knowledge graphs from LLM outputs
  ↓
GPU 1: RAPIDS cuDF/cuGraph + Graphistry
  → Visualize millions of nodes/edges
```

## Documentation Structure

- [Home](https://llcuda.github.io/): Main homepage
- [Quick Start](https://llcuda.github.io/guides/quickstart/): 5-minute setup
- [Installation](https://llcuda.github.io/guides/installation/): Kaggle installation
- [Tutorial Index](https://llcuda.github.io/tutorials/index/): 13 Kaggle notebooks
- [API Reference](https://llcuda.github.io/api/overview/): Complete API docs

## Key Information

**llcuda v2.2.0** is a CUDA 12 inference backend for Unsloth:

- **Binary Package**: 961 MB with llama.cpp build 7760
- **Target Platform**: Kaggle dual Tesla T4 (30GB VRAM)
- **Verified Performance**: 60 tok/s on Gemma 2-2B, 12 tok/s on Llama-70B
- **CUDA Version**: CUDA 12.x
- **Optimizations**: FlashAttention, native tensor-split
- **Format**: GGUF models (29 quantization types)

Remember:
- llcuda v2.2.0 is for Kaggle dual Tesla T4 GPUs
- Install via: pip install llcuda
- Binaries auto-download on first use (961 MB)
- Split-GPU: LLM on GPU 0, Graphistry on GPU 1
- 70B models supported with IQ3_XS quantization
- 13 comprehensive Kaggle tutorial notebooks

## Kaggle Notebooks

llcuda includes 13 comprehensive tutorial notebooks for Kaggle dual T4:

### Beginner (01-03)
- [01 - Quick Start](https://www.kaggle.com/code/waqasm86/01-quickstart-llcuda-v2-2-0): 5-minute introduction
- [02 - Server Setup](https://www.kaggle.com/code/waqasm86/02-llama-server-setup-llcuda-v2-2-0): Configuration guide
- [03 - Multi-GPU](https://www.kaggle.com/code/waqasm86/03-multi-gpu-inference-llcuda-v2-2-0): Dual T4 setup

### Intermediate (04-06)
- [04 - GGUF Quantization](https://www.kaggle.com/code/waqasm86/04-gguf-quantization-llcuda-v2-2-0): K-quants and I-quants
- [05 - Unsloth Integration](https://www.kaggle.com/code/waqasm86/05-unsloth-integration-llcuda-v2-2-0): Fine-tune workflow
- [06 - Split-GPU Graphistry](https://www.kaggle.com/code/waqasm86/06-split-gpu-graphistry-llcuda-v2-2-0): Dual GPU architecture

### Advanced (07-11)
- [07 - Knowledge Graph Extraction](https://www.kaggle.com/code/waqasm86/07-knowledge-graph-extraction-graphistry-v2-2-0): LLM-driven entity & relation graphs
- [08 - Document Network Analysis](https://www.kaggle.com/code/waqasm86/08-document-network-analysis-graphistry-llcuda-v2-2-0): GPU graph analytics for documents
- [09 - Large Models](https://www.kaggle.com/code/waqasm86/09-large-models-kaggle-llcuda-v2-2-0): 70B on dual T4
- [10 - Complete Workflow](https://www.kaggle.com/code/waqasm86/10-complete-workflow-llcuda-v2-2-0): End-to-end guide
- [11 - GGUF Neural Network Visualization](https://www.kaggle.com/code/waqasm86/11-gguf-neural-network-graphistry-vis-executed-2): Full architecture dashboards

## GGUF Quantization Support

llcuda v2.2.0 supports 29 GGUF quantization formats:

**K-Quants** (recommended):
- Q4_K_M, Q5_K_M, Q6_K, Q8_0

**I-Quants** (for 70B models):
- IQ3_XS, IQ3_XXS, IQ2_XXS

## Graphistry Integration

Visualize knowledge graphs extracted from LLM outputs:

- GPU 0: Run Unsloth GGUF LLM via llama-server
- GPU 1: RAPIDS cuDF/cuGraph + Graphistry
- Handle millions of nodes and edges
- Real-time graph visualization