Fast LLM inference on Tesla T4 GPUs with FlashAttention and Tensor Core optimization. Built exclusively for Google Colab and Tesla T4 hardware with GitHub-only distribution.
=== “Tesla T4 Optimized” Built specifically for Tesla T4 (SM 7.5) with:
- ✅ FlashAttention support (2-3x faster)
- ✅ Tensor Core optimization
- ✅ CUDA Graphs for reduced overhead
- ✅ **134 tokens/sec verified** on Gemma 3-1B
=== “GitHub-Only Distribution” No PyPI dependency:
```bash
pip install git+https://github.com/waqasm86/llcuda.git
```
- Binaries auto-download from GitHub Releases (266 MB)
- One-time setup, cached for future use
- Direct from source, always up-to-date
=== “Google Colab Ready” Perfect for cloud notebooks:
- ✅ Tesla T4 Free tier supported
- ✅ One-line install
- ✅ Instant inference
- ✅ Verified 134 tok/s performance
=== “Unsloth Integration” Seamless workflow:
- Fine-tune with Unsloth (2x faster training)
- Export to GGUF format
- Deploy with llcuda (fast inference)
- Production-ready pipeline
Try llcuda on Google Colab right now!
# Install from GitHub
pip install git+https://github.com/waqasm86/llcuda.git
# Run inference
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
result = engine.infer(
"Explain quantum computing in simple terms",
max_tokens=200
)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")
# Expected output: ~134 tokens/sec on Tesla T4
!!! success “First Run Downloads” CUDA binaries (266 MB) download automatically from GitHub Releases v2.0.6 on first import. Subsequent runs use cached binaries - instant startup!
Real Google Colab Tesla T4 results with proven 3x faster performance:
| Model | Quantization | Speed | Latency | VRAM | Status |
|---|---|---|---|---|---|
| Gemma 3-1B | Q4_K_M | 134 tok/s | 690ms | 1.2 GB | ✅ Verified |
| Llama 3.2-3B | Q4_K_M | ~30 tok/s | - | 2.0 GB | Estimated |
| Qwen 2.5-7B | Q4_K_M | ~18 tok/s | - | 5.0 GB | Estimated |
| Llama 3.1-8B | Q4_K_M | ~15 tok/s | - | 5.5 GB | Estimated |
!!! tip “Performance Highlights” - 3x faster than expected (134 vs 45 tok/s initial estimate) - Consistent 130-142 tok/s range across batch inference - Full GPU offload (99 layers on T4) - FlashAttention + Tensor Cores delivering exceptional results
:octicons-file-code-24: See Executed Notebook{ .md-button .md-button–primary }
=== “Interactive Chat” ```python import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
result = engine.infer(user_input, max_tokens=400)
print(f"Assistant: {result.text}")
```
=== “Batch Processing” ```python import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
prompts = [
"What is machine learning?",
"Explain neural networks briefly.",
"Define deep learning concisely."
]
results = engine.batch_infer(prompts, max_tokens=80)
for prompt, result in zip(prompts, results):
print(f"Q: {prompt}")
print(f"A: {result.text}")
print(f"Speed: {result.tokens_per_sec:.1f} tok/s\n")
```
=== “Google Colab” ```python import llcuda
# Verify GPU compatibility
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}")
print(f"Compatible: {compat['compatible']}")
# Load model
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
# Run inference
result = engine.infer(
"Explain artificial intelligence",
max_tokens=300
)
print(result.text)
print(f"Performance: {result.tokens_per_sec:.1f} tok/s")
```
=== “Unsloth Workflow” ```python # Step 1: Fine-tune with Unsloth from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/gemma-3-1b-it",
max_seq_length=2048,
load_in_4bit=True
)
# Train your model...
# Step 2: Export to GGUF
model.save_pretrained_gguf(
"my_model",
tokenizer,
quantization_method="q4_k_m"
)
# Step 3: Deploy with llcuda
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("my_model/unsloth.Q4_K_M.gguf")
result = engine.infer("Your prompt", max_tokens=200)
print(result.text)
```
:octicons-arrow-right-24: Read Changelog{ .md-button }
MIT License - Free for commercial and personal use.