Installation Guide¶
Install llcuda v2.1.0 directly from GitHub - No PyPI needed!
Quick Install¶
Method 1: Direct from GitHub (Recommended)¶
This single command will:
- โ Clone the latest code from GitHub
- โ Install the Python package
- โ Auto-download CUDA binaries (266 MB) from GitHub Releases on first import
Recommended for most users
This is the easiest method and works perfectly on Google Colab, Kaggle, and local systems.
Method 2: Install from Specific Release¶
pip install https://github.com/waqasm86/llcuda/releases/download/v2.1.0/llcuda-2.1.0-py3-none-any.whl
Method 3: Install from Source (Development)¶
What Gets Installed¶
Python Package¶
- Source: GitHub repository (main branch or release tag)
- Size: ~100 KB (Python code only, no binaries)
- Contents: Core Python package, API, bootstrap code
CUDA Binaries (Auto-Downloaded)¶
- Source: GitHub Releases v2.0.6
- URL:
llcuda-binaries-cuda12-t4-v2.0.6.tar.gz - Size: 266 MB (one-time download, cached locally)
- Triggered: On first
import llcuda - Location:
~/.cache/llcuda/or<package>/binaries/ - Compatibility: v2.0.6 binaries work with llcuda v2.1.0+
Binary Package Contents:
llcuda-binaries-cuda12-t4-v2.0.6.tar.gz (266 MB)
โโโ bin/
โ โโโ llama-server (6.5 MB) - Inference server
โ โโโ llama-cli (4.2 MB) - Command-line interface
โ โโโ llama-embedding (3.3 MB) - Embedding generator
โ โโโ llama-bench (581 KB) - Benchmarking tool
โ โโโ llama-quantize (434 KB) - Model quantization
โโโ lib/
โโโ libggml-cuda.so (221 MB) - CUDA kernels + FlashAttention
โโโ libllama.so (2.9 MB) - Llama core library
โโโ Other libraries...
Platform-Specific Instructions¶
Google Colab (Tesla T4)¶
Perfect for cloud notebooks!
# 1. Install
!pip install -q git+https://github.com/waqasm86/llcuda.git
# 2. Import (triggers binary download on first run)
import llcuda
# 3. Verify GPU
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}") # Should show: Tesla T4
print(f"Compatible: {compat['compatible']}") # Should show: True
# 4. Ready to use!
engine = llcuda.InferenceEngine()
First Run
The first import downloads 266 MB of binaries (takes 1-2 minutes). Subsequent sessions reuse cached binaries - instant startup!
Local Linux (Ubuntu/Debian)¶
Requirements: - Python 3.11+ - CUDA 12.x runtime - Tesla T4 GPU or compatible
# 1. Ensure CUDA 12 is installed
nvidia-smi # Should show CUDA 12.x
# 2. Install llcuda
pip install git+https://github.com/waqasm86/llcuda.git
# 3. Test installation
python3 -c "import llcuda; print(llcuda.__version__)"
# Output: 2.1.0
System Dependencies (usually pre-installed):
Kaggle Notebooks¶
# 1. Enable GPU accelerator
# Settings โ Accelerator โ GPU T4 x2
# 2. Install
!pip install -q git+https://github.com/waqasm86/llcuda.git
# 3. Import and verify
import llcuda
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}")
# 4. Start using
engine = llcuda.InferenceEngine()
Windows with WSL2¶
Prerequisites: - Windows 11 with WSL2 - NVIDIA GPU with CUDA support - CUDA 12.x installed in WSL2
Verification¶
After installation, verify everything works:
import llcuda
# 1. Check version
print(f"llcuda version: {llcuda.__version__}")
# Expected: 2.1.0
# 2. Check GPU compatibility
compat = llcuda.check_gpu_compatibility()
print(f"GPU: {compat['gpu_name']}")
print(f"Compute Capability: SM {compat['compute_capability']}")
print(f"Platform: {compat['platform']}")
print(f"Compatible: {compat['compatible']}")
# 3. Quick inference test
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
result = engine.infer("What is 2+2?", max_tokens=20)
print(f"Response: {result.text}")
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")
# Expected on T4: ~134 tok/s
Manual Binary Installation (Advanced)¶
If automatic download fails, install binaries manually:
# 1. Download binary package
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz
# 2. Verify checksum
echo "5a27d2e1a73ae3d2f1d2ba8cf557b76f54200208c8df269b1bd0d9ee176bb49d llcuda-binaries-cuda12-t4-v2.0.6.tar.gz" | sha256sum -c
# 3. Extract to cache directory
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4-v2.0.6.tar.gz -C ~/.cache/llcuda/
# 4. Or extract to package directory
python3 -c "import llcuda; print(llcuda._BIN_DIR)"
# Extract to the printed directory
Testing Your Installation¶
Basic Test¶
GPU Test¶
compat = llcuda.check_gpu_compatibility()
assert compat['compatible'], "GPU not compatible!"
print(f"โ
GPU compatible: {compat['gpu_name']}")
Inference Test¶
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
silent=True
)
result = engine.infer("Hello!", max_tokens=10)
assert result.tokens_generated > 0
print(f"โ
Inference working: {result.tokens_per_sec:.1f} tok/s")
Requirements¶
System Requirements¶
- Python: 3.11 or higher
- CUDA: 12.x runtime
- GPU: Tesla T4 (SM 7.5) - Primary target
- RAM: 4 GB minimum
- Disk: 1 GB free space (for binaries and models)
Python Dependencies¶
Automatically installed by pip:
Optional for development:
Upgrading¶
Upgrade to Latest Version¶
Force Reinstall¶
Clear Cache and Reinstall¶
# Remove cached binaries
rm -rf ~/.cache/llcuda/
# Reinstall
pip uninstall llcuda -y
pip install git+https://github.com/waqasm86/llcuda.git
Uninstallation¶
# Remove Python package
pip uninstall llcuda -y
# Remove cached binaries
rm -rf ~/.cache/llcuda/
# Remove package installation
pip cache purge
Troubleshooting¶
Binary Download Fails¶
Error: RuntimeError: Binary download failed
Solution:
# Check internet connection
import requests
response = requests.get("https://github.com")
print(response.status_code) # Should be 200
# Try manual download (see Manual Binary Installation above)
Import Error¶
Error: ModuleNotFoundError: No module named 'llcuda'
Solution:
# Verify installation
pip list | grep llcuda
# Reinstall
pip install --force-reinstall git+https://github.com/waqasm86/llcuda.git
GPU Not Detected¶
Error: RuntimeError: No CUDA GPUs detected
Solution:
# Verify CUDA is working
nvidia-smi
# Check GPU visibility
python3 -c "import subprocess; print(subprocess.run(['nvidia-smi'], capture_output=True).stdout)"
Next Steps¶
- Quick Start Guide - Get started in 5 minutes
- Google Colab Tutorial - Complete walkthrough
- Troubleshooting - Common issues and solutions
- API Reference - Detailed API documentation
Need help? Open an issue on GitHub