Frequently Asked Questions¶

Common questions about llcuda v2.2.0 (Kaggle dual T4).

General¶

What is llcuda?¶

llcuda is a CUDA 12-first inference backend designed for Kaggle dual Tesla T4. It ships a lightweight Python package and auto-downloads the CUDA binaries on first import.

Why Kaggle dual T4 only?¶

llcuda v2.2.0 is optimized for SM 7.5 (Tesla T4) and the split‑GPU workflow (GPU 0: LLM, GPU 1: Graphistry/RAPIDS). Other environments are not supported for this release.

How does llcuda compare to other solutions?¶

llcuda focuses on fast setup + optimized inference for 1B–5B GGUF models on Kaggle dual T4, with built‑in visualization workflows.

Installation¶

How do I install llcuda?¶

pip install git+https://github.com/llcuda/llcuda.git@v2.2.0

Can I install from PyPI?¶

No. v2.2.0 is GitHub‑only (with optional HuggingFace mirror).

Why do binaries download on first import?¶

To keep the Python package small. The ~961 MB CUDA bundle downloads once and caches in ~/.cache/llcuda.

Compatibility¶

Which GPUs are supported?¶

Tesla T4 (SM 7.5), Kaggle dual‑GPU only.

What Python versions are supported?¶

Python 3.11+.

What CUDA versions are supported?¶

CUDA 12.x (Kaggle runtime).

Models¶

Which models can I use?¶

Small GGUF models (1B–5B), Q4_K_M recommended.

How do I load a model from HuggingFace?¶

engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf"
)