Skip to content

Troubleshooting Guide

Solutions to common issues with llcuda v2.1.0 on Tesla T4 GPUs.

Installation Issues

pip install fails

Symptom:

ERROR: Could not find a version that satisfies the requirement llcuda

Solution:

# Install from GitHub (not PyPI for v2.1.0)
pip install git+https://github.com/waqasm86/llcuda.git

# Or use specific release
pip install https://github.com/waqasm86/llcuda/releases/download/v2.1.0/llcuda-2.1.0-py3-none-any.whl

Binary download fails

Symptom:

Failed to download CUDA binaries: HTTP 404

Solution:

# Manually download binaries
import requests
import tarfile
from pathlib import Path

url = "https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz"
cache_dir = Path.home() / ".cache" / "llcuda"
cache_dir.mkdir(parents=True, exist_ok=True)

# Download
response = requests.get(url)
tar_path = cache_dir / "binaries.tar.gz"
tar_path.write_bytes(response.content)

# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
    tar.extractall(cache_dir)

GPU Issues

GPU not detected

Symptom:

CUDA not available
No CUDA GPU detected

Solution:

# Check NVIDIA driver
nvidia-smi

# If fails in Colab, verify runtime type
# Runtime > Change runtime type > GPU > T4

# Verify CUDA version
nvcc --version  # Should show CUDA 12.x

Wrong GPU detected

Symptom:

Your GPU is not Tesla T4
GPU: Tesla P100 (SM 6.0)

Solution: llcuda v2.1.0 is Tesla T4-only. For other GPUs, use v1.2.2:

pip install llcuda==1.2.2

Model Loading Issues

Model not found

Symptom:

FileNotFoundError: Model file not found: gemma-3-1b-Q4_K_M

Solution:

# Use full HuggingFace path
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)

# Or download manually
from llcuda.models import download_model
model_path = download_model(
    "unsloth/gemma-3-1b-it-GGUF",
    "gemma-3-1b-it-Q4_K_M.gguf"
)

Out of memory

Symptom:

CUDA out of memory
Failed to allocate tensor

Solution:

# Reduce GPU layers
engine.load_model("model.gguf", gpu_layers=20)

# Reduce context size
engine.load_model("model.gguf", ctx_size=1024)

# Use smaller quantization
# Q4_K_M instead of Q8_0

Server Issues

Server won't start

Symptom:

RuntimeError: Failed to start llama-server

Solution:

# Check if port is in use
import socket
sock = socket.socket()
try:
    sock.bind(('127.0.0.1', 8090))
    print("Port 8090 is free")
except:
    print("Port 8090 is in use - trying different port")
sock.close()

# Use different port
engine = llcuda.InferenceEngine(server_url="http://127.0.0.1:8091")

Server crashes

Symptom:

llama-server process died unexpectedly

Solution:

# Run without silent mode to see errors
engine.load_model("model.gguf", silent=False, verbose=True)

# Try reducing memory usage
engine.load_model(
    "model.gguf",
    gpu_layers=20,
    ctx_size=1024
)

Performance Issues

Slow inference (<50 tok/s)

Solutions:

# 1. Increase GPU offload
engine.load_model("model.gguf", gpu_layers=99)

# 2. Use Q4_K_M quantization
engine.load_model("model-Q4_K_M.gguf")

# 3. Reduce context
engine.load_model("model.gguf", ctx_size=2048)

# 4. Check GPU usage
!nvidia-smi  # Should show 80%+ GPU utilization

High latency (>2000ms)

Solution:

# Reduce max_tokens
result = engine.infer("Prompt", max_tokens=50)

# Use smaller model (Gemma 3-1B instead of Llama 3.1-8B)

# Optimize parameters
engine.load_model(
    "gemma-3-1b-Q4_K_M",
    gpu_layers=99,
    ctx_size=1024,
    batch_size=512
)

Common Error Messages

"Binaries not found"

# Reinstall with cache clear
pip uninstall llcuda -y
pip cache purge
pip install git+https://github.com/waqasm86/llcuda.git --no-cache-dir

"LD_LIBRARY_PATH not set"

import os
from pathlib import Path

# Manually set library path
lib_dir = Path.home() / ".cache" / "llcuda" / "lib"
os.environ["LD_LIBRARY_PATH"] = f"{lib_dir}:{os.environ.get('LD_LIBRARY_PATH', '')}"

"CUDA version mismatch"

# Check CUDA version
nvcc --version
nvidia-smi  # Look for "CUDA Version"

# llcuda requires CUDA 12.0+
# Google Colab has CUDA 12.2+ by default

Google Colab Specific

T4 not available

Solution: - In Colab: Runtime > Change runtime type > GPU > T4 - Free tier: T4 not always available, try later or use Colab Pro - Pro tier: T4 guaranteed

Runtime disconnects

Solution: Keep connection alive with periodic activity or use Colab Pro for longer runtimes.

Debug Mode

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf", verbose=True, silent=False)

Getting Help

  1. Check error details:

    result = engine.infer("test", max_tokens=10)
    if not result.success:
        print(f"Error: {result.error_message}")
    

  2. GitHub Issues: github.com/waqasm86/llcuda/issues

  3. Include in bug reports:

  4. llcuda version (llcuda.__version__)
  5. GPU model (nvidia-smi)
  6. CUDA version (nvcc --version)
  7. Python version (python --version)
  8. Full error message
  9. Minimal reproducible code

Quick Fixes Checklist

  • GPU is Tesla T4 (check with nvidia-smi)
  • CUDA 12.0+ installed (check with nvcc --version)
  • Latest llcuda from GitHub (pip install git+https://github.com/waqasm86/llcuda.git)
  • Model exists and is accessible
  • Port 8090 is available
  • Sufficient VRAM for model
  • Using Q4_K_M quantization
  • gpu_layers=99 for full offload

Next Steps