Skip to content

LlamaCppClient API

OpenAI-compatible client for llama-server.

Overview

The LlamaCppClient provides an OpenAI-compatible interface to llama-server.

Basic Usage

from llcuda.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://localhost:8080")

response = client.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

Class Reference

LlamaCppClient

class LlamaCppClient:
    def __init__(self, base_url: str = "http://localhost:8080"):
        """Initialize client.

        Args:
            base_url: Base URL of llama-server
        """

Methods

create_chat_completion()

def create_chat_completion(
    self,
    messages: List[Dict[str, str]],
    max_tokens: int = 100,
    temperature: float = 0.7,
    top_p: float = 0.9,
    stream: bool = False
) -> Dict:
    """Create chat completion.

    Args:
        messages: List of message dicts with 'role' and 'content'
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        top_p: Nucleus sampling parameter
        stream: Enable streaming

    Returns:
        Response dict with 'choices', 'usage', etc.
    """

Examples

See API Examples