Skip to content

API Reference Overview

Complete API documentation for llcuda v2.1.0.

Main Components

llcuda provides a simple, PyTorch-style API for GPU-accelerated LLM inference.

Core Classes

Class Purpose Documentation
InferenceEngine Main interface for model loading and inference Details
InferenceResult Container for inference results with metrics Details

Utility Functions

Function Purpose Documentation
check_gpu_compatibility() Verify GPU support Details
get_device_properties() Get GPU device information Details

🚀 Quick API Reference

Basic Usage

import llcuda

# Create engine
engine = llcuda.InferenceEngine()

# Load model
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

# Run inference
result = engine.infer("What is AI?", max_tokens=100)

# Access results
print(result.text)                    # Generated text
print(result.tokens_per_sec)          # Speed in tokens/sec
print(result.latency_ms)              # Latency in milliseconds
print(result.tokens_generated)        # Number of tokens generated

InferenceEngine Methods

__init__(server_url=None)

Create a new inference engine instance.

Parameters: - server_url (str, optional): Custom llama-server URL. Default: http://127.0.0.1:8090

load_model(model_path, silent=False, auto_start=True, **kwargs)

Load a GGUF model for inference.

Parameters: - model_path (str): Model identifier or path - HuggingFace: "unsloth/repo-name:filename.gguf" - Registry: "gemma-3-1b-Q4_K_M" - Local: "/path/to/model.gguf" - silent (bool): Suppress llama-server output. Default: False - auto_start (bool): Start server automatically. Default: True - **kwargs: Additional options (context_size, gpu_layers, etc.)

infer(prompt, max_tokens=512, temperature=0.7, **kwargs)

Run inference on a single prompt.

Parameters: - prompt (str): Input text - max_tokens (int): Maximum tokens to generate. Default: 512 - temperature (float): Sampling temperature. Default: 0.7 - top_p (float): Nucleus sampling threshold. Default: 0.9 - top_k (int): Top-k sampling. Default: 40 - stop_sequences (list): Stop generation at these sequences

Returns: - InferenceResult: Result object with text and metrics

batch_infer(prompts, max_tokens=512, **kwargs)

Run inference on multiple prompts.

Parameters: - prompts (list[str]): List of input texts - max_tokens (int): Maximum tokens per prompt - **kwargs: Same as infer()

Returns: - list[InferenceResult]: List of results

get_metrics()

Get aggregated performance metrics.

Returns: - dict: Metrics dictionary with throughput and latency stats

InferenceResult Attributes

Attribute Type Description
text str Generated text
tokens_per_sec float Generation speed
latency_ms float Total latency in ms
tokens_generated int Number of tokens

Utility Functions

check_gpu_compatibility()

Check if current GPU is compatible with llcuda.

Returns:

{
    'gpu_name': str,          # e.g., "Tesla T4"
    'compute_capability': str, # e.g., "7.5"
    'compatible': bool,       # True if supported
    'platform': str          # e.g., "colab", "local"
}

Example:

compat = llcuda.check_gpu_compatibility()
if compat['compatible']:
    print(f"✅ {compat['gpu_name']} is compatible!")
else:
    print(f"⚠️ {compat['gpu_name']} may not work")

📚 Detailed Documentation

🔗 See Also