Quick Start¶
Get started with llcuda v2.1.0 in 5 minutes!
5-Minute Quickstart¶
Step 1: Install llcuda¶
Google Colab Users
Add ! before the command: !pip install -q git+https://github.com/waqasm86/llcuda.git
Step 2: Import and Verify¶
import llcuda
# Check version
print(f"llcuda version: {llcuda.__version__}")
# Output: 2.1.0
# Verify GPU
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}")
print(f"Compatible: {compat['compatible']}")
First Import
First import llcuda downloads CUDA binaries (266 MB) from GitHub Releases. This takes 1-2 minutes. Subsequent imports are instant!
Step 3: Load a Model¶
# Create inference engine
engine = llcuda.InferenceEngine()
# Load Gemma 3-1B from Unsloth
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True # Suppress llama-server output
)
print("✅ Model loaded!")
Step 4: Run Inference¶
# Ask a question
result = engine.infer(
"Explain quantum computing in simple terms",
max_tokens=200,
temperature=0.7
)
# Print results
print(f"Response: {result.text}")
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")
print(f"Latency: {result.latency_ms:.0f}ms")
Expected output on Tesla T4:
Complete Example¶
Here's a complete, copy-paste ready example:
import llcuda
# Initialize engine
engine = llcuda.InferenceEngine()
# Load model (downloads ~800 MB on first run)
print("Loading model...")
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
# Single inference
result = engine.infer(
"What is machine learning?",
max_tokens=150
)
print(f"Response: {result.text}")
print(f"Performance: {result.tokens_per_sec:.1f} tok/s")
Common Use Cases¶
Interactive Chat¶
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
print("Chat with Gemma 3-1B (type 'exit' to quit)")
while True:
user_input = input("\nYou: ")
if user_input.lower() == "exit":
break
result = engine.infer(user_input, max_tokens=300)
print(f"AI: {result.text}")
print(f"({result.tokens_per_sec:.1f} tok/s)")
Batch Processing¶
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
# Multiple prompts
prompts = [
"What is AI?",
"Explain neural networks.",
"Define deep learning."
]
# Process in batch
results = engine.batch_infer(prompts, max_tokens=80)
for prompt, result in zip(prompts, results):
print(f"\nQ: {prompt}")
print(f"A: {result.text}")
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")
Try on Google Colab¶
Click the button below to try llcuda in your browser:
What this notebook includes:
- ✅ Complete Tesla T4 setup guide
- ✅ GPU verification steps
- ✅ Binary download walkthrough
- ✅ Multiple inference examples
- ✅ Performance benchmarking
- ✅ Batch processing demo
Expected Performance¶
On Google Colab Tesla T4:
| Task | Speed | Latency |
|---|---|---|
| Simple query | 134 tok/s | ~690ms |
| Code generation | 136 tok/s | ~1.5s |
| Batch (4 prompts) | 135 tok/s avg | ~2.4s total |
These are verified real-world results! See the executed notebook for proof.
Pro Tips¶
Silent Mode¶
Suppress llama-server output for cleaner logs:
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True # ← Add this!
)
Context Manager¶
Auto-cleanup resources:
with llcuda.InferenceEngine() as engine:
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
result = engine.infer("Test prompt", max_tokens=50)
print(result.text)
# Server automatically stopped here
Check GPU Before Loading¶
# Verify GPU compatibility first
compat = llcuda.check_gpu_compatibility()
if not compat['compatible']:
print(f"⚠️ GPU {compat['gpu_name']} may not be compatible")
print(f" llcuda is optimized for Tesla T4")
else:
print(f"✅ {compat['gpu_name']} is compatible!")
# Proceed with loading model...
Troubleshooting¶
Model Download Slow?¶
HuggingFace downloads can be slow. First download is cached:
# First run: Downloads ~800 MB (2-3 minutes)
engine.load_model("unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf")
# Subsequent runs: Uses cached model (instant)
engine.load_model("unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf")
Out of Memory?¶
Try a smaller model or reduce context:
# Smaller model
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q2_K.gguf", # Q2_K instead of Q4_K_M
silent=True
)
# Or reduce context size
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
context_size=2048, # Default is 4096
silent=True
)
Binary Download Failed?¶
Manual installation:
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4-v2.0.6.tar.gz -C ~/.cache/llcuda/
Next Steps¶
-
Complete walkthrough with Tesla T4 examples
-
Full API documentation and advanced features
-
Benchmarks and optimization tips
-
Additional use cases and code samples
Questions? Check the FAQ or open an issue!