Troubleshooting Guide¶
Solutions to common issues with llcuda v2.1.0 on Tesla T4 GPUs.
Installation Issues¶
pip install fails¶
Symptom:
Solution:
# Install from GitHub (not PyPI for v2.1.0)
pip install git+https://github.com/waqasm86/llcuda.git
# Or use specific release
pip install https://github.com/waqasm86/llcuda/releases/download/v2.1.0/llcuda-2.1.0-py3-none-any.whl
Binary download fails¶
Symptom:
Solution:
# Manually download binaries
import requests
import tarfile
from pathlib import Path
url = "https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz"
cache_dir = Path.home() / ".cache" / "llcuda"
cache_dir.mkdir(parents=True, exist_ok=True)
# Download
response = requests.get(url)
tar_path = cache_dir / "binaries.tar.gz"
tar_path.write_bytes(response.content)
# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
tar.extractall(cache_dir)
GPU Issues¶
GPU not detected¶
Symptom:
Solution:
# Check NVIDIA driver
nvidia-smi
# If fails in Colab, verify runtime type
# Runtime > Change runtime type > GPU > T4
# Verify CUDA version
nvcc --version # Should show CUDA 12.x
Wrong GPU detected¶
Symptom:
Solution: llcuda v2.1.0 is Tesla T4-only. For other GPUs, use v1.2.2:
Model Loading Issues¶
Model not found¶
Symptom:
Solution:
# Use full HuggingFace path
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)
# Or download manually
from llcuda.models import download_model
model_path = download_model(
"unsloth/gemma-3-1b-it-GGUF",
"gemma-3-1b-it-Q4_K_M.gguf"
)
Out of memory¶
Symptom:
Solution:
# Reduce GPU layers
engine.load_model("model.gguf", gpu_layers=20)
# Reduce context size
engine.load_model("model.gguf", ctx_size=1024)
# Use smaller quantization
# Q4_K_M instead of Q8_0
Server Issues¶
Server won't start¶
Symptom:
Solution:
# Check if port is in use
import socket
sock = socket.socket()
try:
sock.bind(('127.0.0.1', 8090))
print("Port 8090 is free")
except:
print("Port 8090 is in use - trying different port")
sock.close()
# Use different port
engine = llcuda.InferenceEngine(server_url="http://127.0.0.1:8091")
Server crashes¶
Symptom:
Solution:
# Run without silent mode to see errors
engine.load_model("model.gguf", silent=False, verbose=True)
# Try reducing memory usage
engine.load_model(
"model.gguf",
gpu_layers=20,
ctx_size=1024
)
Performance Issues¶
Slow inference (<50 tok/s)¶
Solutions:
# 1. Increase GPU offload
engine.load_model("model.gguf", gpu_layers=99)
# 2. Use Q4_K_M quantization
engine.load_model("model-Q4_K_M.gguf")
# 3. Reduce context
engine.load_model("model.gguf", ctx_size=2048)
# 4. Check GPU usage
!nvidia-smi # Should show 80%+ GPU utilization
High latency (>2000ms)¶
Solution:
# Reduce max_tokens
result = engine.infer("Prompt", max_tokens=50)
# Use smaller model (Gemma 3-1B instead of Llama 3.1-8B)
# Optimize parameters
engine.load_model(
"gemma-3-1b-Q4_K_M",
gpu_layers=99,
ctx_size=1024,
batch_size=512
)
Common Error Messages¶
"Binaries not found"¶
# Reinstall with cache clear
pip uninstall llcuda -y
pip cache purge
pip install git+https://github.com/waqasm86/llcuda.git --no-cache-dir
"LD_LIBRARY_PATH not set"¶
import os
from pathlib import Path
# Manually set library path
lib_dir = Path.home() / ".cache" / "llcuda" / "lib"
os.environ["LD_LIBRARY_PATH"] = f"{lib_dir}:{os.environ.get('LD_LIBRARY_PATH', '')}"
"CUDA version mismatch"¶
# Check CUDA version
nvcc --version
nvidia-smi # Look for "CUDA Version"
# llcuda requires CUDA 12.0+
# Google Colab has CUDA 12.2+ by default
Google Colab Specific¶
T4 not available¶
Solution: - In Colab: Runtime > Change runtime type > GPU > T4 - Free tier: T4 not always available, try later or use Colab Pro - Pro tier: T4 guaranteed
Runtime disconnects¶
Solution: Keep connection alive with periodic activity or use Colab Pro for longer runtimes.
Debug Mode¶
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf", verbose=True, silent=False)
Getting Help¶
-
Check error details:
-
GitHub Issues: github.com/waqasm86/llcuda/issues
-
Include in bug reports:
- llcuda version (
llcuda.__version__) - GPU model (
nvidia-smi) - CUDA version (
nvcc --version) - Python version (
python --version) - Full error message
- Minimal reproducible code
Quick Fixes Checklist¶
- GPU is Tesla T4 (check with
nvidia-smi) - CUDA 12.0+ installed (check with
nvcc --version) - Latest llcuda from GitHub (
pip install git+https://github.com/waqasm86/llcuda.git) - Model exists and is accessible
- Port 8090 is available
- Sufficient VRAM for model
- Using Q4_K_M quantization
- gpu_layers=99 for full offload
Next Steps¶
- FAQ - Frequently asked questions
- Performance Optimization - Speed up inference
- First Steps - Getting started guide
- GitHub Issues - Report bugs