llcuda

llcuda v2.0.6: Tesla T4 CUDA Inference

Version Python CUDA GPU License

Fast LLM inference on Tesla T4 GPUs with FlashAttention and Tensor Core optimization. Built exclusively for Google Colab and Tesla T4 hardware with GitHub-only distribution.

:rocket: Why llcuda v2.0.6?

=== “Tesla T4 Optimized” Built specifically for Tesla T4 (SM 7.5) with:

- ✅ FlashAttention support (2-3x faster)
- ✅ Tensor Core optimization
- ✅ CUDA Graphs for reduced overhead
- ✅ **134 tokens/sec verified** on Gemma 3-1B

=== “GitHub-Only Distribution” No PyPI dependency:

```bash
pip install git+https://github.com/waqasm86/llcuda.git
```

- Binaries auto-download from GitHub Releases (266 MB)
- One-time setup, cached for future use
- Direct from source, always up-to-date

=== “Google Colab Ready” Perfect for cloud notebooks:

- ✅ Tesla T4 Free tier supported
- ✅ One-line install
- ✅ Instant inference
- ✅ Verified 134 tok/s performance

=== “Unsloth Integration” Seamless workflow:

- Fine-tune with Unsloth (2x faster training)
- Export to GGUF format
- Deploy with llcuda (fast inference)
- Production-ready pipeline

:fire: Quick Start

Try llcuda on Google Colab right now!

Open In Colab

60-Second Setup

# Install from GitHub
pip install git+https://github.com/waqasm86/llcuda.git

# Run inference
import llcuda

engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

result = engine.infer(
    "Explain quantum computing in simple terms",
    max_tokens=200
)

print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")
# Expected output: ~134 tokens/sec on Tesla T4

!!! success “First Run Downloads” CUDA binaries (266 MB) download automatically from GitHub Releases v2.0.6 on first import. Subsequent runs use cached binaries - instant startup!

:chart_with_upwards_trend: Verified Performance

Real Google Colab Tesla T4 results with proven 3x faster performance:

Model Quantization Speed Latency VRAM Status
Gemma 3-1B Q4_K_M 134 tok/s 690ms 1.2 GB ✅ Verified
Llama 3.2-3B Q4_K_M ~30 tok/s - 2.0 GB Estimated
Qwen 2.5-7B Q4_K_M ~18 tok/s - 5.0 GB Estimated
Llama 3.1-8B Q4_K_M ~15 tok/s - 5.5 GB Estimated

!!! tip “Performance Highlights” - 3x faster than expected (134 vs 45 tok/s initial estimate) - Consistent 130-142 tok/s range across batch inference - Full GPU offload (99 layers on T4) - FlashAttention + Tensor Cores delivering exceptional results

:octicons-file-code-24: See Executed Notebook{ .md-button .md-button–primary }

:sparkles: Features

- :material-rocket-launch: **Auto-Download** --- Fetch CUDA binaries and GGUF models automatically - GitHub Releases integration - HuggingFace model support - Smart caching system - :material-speedometer: **Optimized for T4** --- Built specifically for Tesla T4 GPUs - SM 7.5 targeting - FlashAttention enabled - Tensor Core support - :material-chat: **Easy API** --- PyTorch-style inference interface - Single-line model loading - Batch processing - Streaming support - :material-cloud-upload: **Production Ready** --- Reliable and well-tested - Comprehensive error handling - Silent mode for servers - MIT licensed

:books: Use Cases

=== “Interactive Chat” ```python import llcuda

engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break

    result = engine.infer(user_input, max_tokens=400)
    print(f"Assistant: {result.text}")
```

=== “Batch Processing” ```python import llcuda

engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

prompts = [
    "What is machine learning?",
    "Explain neural networks briefly.",
    "Define deep learning concisely."
]

results = engine.batch_infer(prompts, max_tokens=80)

for prompt, result in zip(prompts, results):
    print(f"Q: {prompt}")
    print(f"A: {result.text}")
    print(f"Speed: {result.tokens_per_sec:.1f} tok/s\n")
```

=== “Google Colab” ```python import llcuda

# Verify GPU compatibility
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}")
print(f"Compatible: {compat['compatible']}")

# Load model
engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

# Run inference
result = engine.infer(
    "Explain artificial intelligence",
    max_tokens=300
)
print(result.text)
print(f"Performance: {result.tokens_per_sec:.1f} tok/s")
```

=== “Unsloth Workflow” ```python # Step 1: Fine-tune with Unsloth from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/gemma-3-1b-it",
    max_seq_length=2048,
    load_in_4bit=True
)

# Train your model...

# Step 2: Export to GGUF
model.save_pretrained_gguf(
    "my_model",
    tokenizer,
    quantization_method="q4_k_m"
)

# Step 3: Deploy with llcuda
import llcuda

engine = llcuda.InferenceEngine()
engine.load_model("my_model/unsloth.Q4_K_M.gguf")

result = engine.infer("Your prompt", max_tokens=200)
print(result.text)
```

:books: Next Steps

- [:material-rocket-launch: **Quick Start Guide**](/guides/quickstart.html) Get started in 5 minutes with step-by-step instructions - [:material-download: **Installation**](/guides/installation.html) Detailed installation for Google Colab and local systems - [:material-google: **Google Colab Tutorial**](/tutorials/gemma-3-1b-colab.html) Complete walkthrough with Tesla T4 GPU examples - [:material-code-braces: **API Reference**](/api/overview.html) Full API documentation and advanced usage - [:material-chart-line: **Performance Benchmarks**](performance/benchmarks.md) Detailed benchmarks and optimization tips - [:material-notebook: **Jupyter Notebooks**](notebooks/index.md) Ready-to-run Colab notebooks with examples

:sparkles: What’s New in v2.0.6

:octicons-arrow-right-24: Read Changelog{ .md-button }

:handshake: Community & Support

:balance_scale: License

MIT License - Free for commercial and personal use.


Built with ❤️ by Waqas Muhammad | Powered by llama.cpp | Optimized for Unsloth