Skip to content

Building CUDA Binaries for llcuda

This tutorial shows how to build optimized CUDA binaries compatible with llcuda v2.1.0+ on Tesla T4 GPUs. The binaries are built from llama.cpp with FlashAttention and CUDA 12 support.

Advanced Topic

This guide is for advanced users who want to customize binaries or contribute to llcuda development. Regular users should use the pre-built binaries that auto-download from GitHub Releases.

Overview

llcuda v2.1.0 uses pre-built v2.0.6 CUDA binaries based on llama.cpp with these optimizations:

Binary Compatibility

The v2.0.6 binaries are fully compatible with llcuda v2.1.0 and later versions. The binary format remains stable while the Python API receives new features.

  • FlashAttention - 2-3x faster attention for long contexts
  • CUDA Graphs - Reduced kernel launch overhead
  • Tensor Cores - INT4/INT8 hardware acceleration (SM 7.5)
  • cuBLAS - Optimized matrix multiplication
  • SM 7.5 targeting - Tesla T4 specific optimizations

Prerequisites

System Requirements

  • GPU: Tesla T4 (SM 7.5) - Google Colab recommended
  • CUDA: 12.0 or higher (12.4 recommended)
  • GCC: 11.x or 12.x
  • CMake: 3.18+
  • Disk Space: 5 GB for build artifacts
  • RAM: 8 GB minimum
  • Time: 20-30 minutes on T4

Software Dependencies

# Update system
sudo apt-get update
sudo apt-get install -y build-essential cmake git wget

# Verify CUDA installation
nvcc --version
nvidia-smi

# Verify GCC version
gcc --version  # Should be 11.x or 12.x

Build Process

Method 1: Using Google Colab Notebook

The easiest way to build binaries is using the provided Colab notebook:

  1. Open the build notebook:
  2. build_llcuda_v2_t4_colab.ipynb

  3. Select T4 GPU runtime:

  4. Runtime > Change runtime type > GPU > T4

  5. Run all cells:

  6. Runtime > Run all

  7. Download built binaries:

  8. Files will be in /content/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz

Method 2: Manual Build

For local systems with Tesla T4:

Step 1: Clone llama.cpp

# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Checkout stable commit (optional but recommended)
git checkout b1698  # Replace with known-good commit

Step 2: Configure CMake for T4

# Create build directory
mkdir build
cd build

# Configure with CUDA 12 and FlashAttention
cmake .. \
  -DLLAMA_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_DMMV_F16=ON \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=75 \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

Key CMake flags explained:

Flag Purpose
LLAMA_CUDA=ON Enable CUDA support
GGML_CUDA_FA_ALL_QUANTS=ON FlashAttention for all quantization types
GGML_CUDA_GRAPHS=ON CUDA graphs for reduced overhead
CMAKE_CUDA_ARCHITECTURES=75 Target Tesla T4 (SM 7.5)
GGML_CUDA_FORCE_CUBLAS=ON Use cuBLAS for matrix operations

Step 3: Build Binaries

# Build with parallel jobs
cmake --build . --config Release -j$(nproc)

# Expected output:
# [100%] Built target llama-server
# [100%] Built target llama-cli
# [100%] Built target llama-quantize
# [100%] Built target llama-embedding

Build time: ~20-25 minutes on T4

Step 4: Verify Built Binaries

# Check binaries exist
ls -lh bin/

# Expected files:
# llama-server       (~6.5 MB)
# llama-cli          (~4.2 MB)
# llama-quantize     (~434 KB)
# llama-embedding    (~3.3 MB)
# llama-bench        (~581 KB)

# Check library files
ls -lh src/

# Expected libraries:
# libggml-cuda.so    (~221 MB)
# libllama.so        (~15 MB)
# libggml-base.so    (~8 MB)
# libggml-cpu.so     (~6 MB)

Step 5: Test Binary

# Quick test (should show CUDA support)
./bin/llama-server --version

# Expected output:
# llama-server: built with CUDA 12.4 for compute capability 7.5
# FlashAttention: enabled
# CUDA graphs: enabled

Method 3: Using Build Script

llcuda includes a build script for automation:

# Clone llcuda repository
git clone https://github.com/waqasm86/llcuda.git
cd llcuda/scripts

# Run build script
chmod +x build_t4_binaries.sh
./build_t4_binaries.sh

# Binaries will be in ../build-artifacts/

Creating Distribution Package

After building, create a distribution package:

Step 1: Organize Files

# Create package directory structure
mkdir -p llcuda-binaries-cuda12-t4/bin
mkdir -p llcuda-binaries-cuda12-t4/lib

# Copy binaries
cd llama.cpp/build
cp bin/llama-server ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-cli ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-quantize ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-embedding ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-bench ../../../llcuda-binaries-cuda12-t4/bin/

# Copy libraries
cp src/libggml-cuda.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libllama.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libggml-base.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libggml-cpu.so ../../../llcuda-binaries-cuda12-t4/lib/

Step 2: Create Tarball

# Create compressed archive
cd ../../..
tar -czf llcuda-binaries-cuda12-t4-v2.0.6.tar.gz llcuda-binaries-cuda12-t4/

# Verify size (~266 MB)
ls -lh llcuda-binaries-cuda12-t4-v2.0.6.tar.gz

# Calculate SHA256 checksum
sha256sum llcuda-binaries-cuda12-t4-v2.0.6.tar.gz > llcuda-binaries-cuda12-t4-v2.0.6.tar.gz.sha256

Step 3: Test Package

# Extract to test
mkdir test-extract
cd test-extract
tar -xzf ../llcuda-binaries-cuda12-t4-v2.0.6.tar.gz

# Test llama-server
export LD_LIBRARY_PATH=llcuda-binaries-cuda12-t4/lib:$LD_LIBRARY_PATH
./llcuda-binaries-cuda12-t4/bin/llama-server --version

# Should show CUDA 12.x, SM 7.5, FlashAttention enabled

Build Configuration Options

FlashAttention Variants

# FlashAttention for all quantizations (default)
-DGGML_CUDA_FA_ALL_QUANTS=ON

# FlashAttention for FP16 only (smaller binary)
-DGGML_CUDA_FA_ALL_QUANTS=OFF

# Disable FlashAttention (not recommended)
# Remove -DGGML_CUDA_FA_ALL_QUANTS flag

Compute Capability Targeting

# Tesla T4 only (SM 7.5)
-DCMAKE_CUDA_ARCHITECTURES=75

# Multiple architectures (larger binary)
-DCMAKE_CUDA_ARCHITECTURES="70;75;80;86"

# All modern architectures
-DCMAKE_CUDA_ARCHITECTURES="70;75;80;86;89;90"

Memory Optimizations

# Use FP16 for DMMV (faster, more memory)
-DGGML_CUDA_DMMV_F16=ON

# Force matrix multiplication kernels
-DGGML_CUDA_FORCE_MMQ=OFF

# Use cuBLAS for better performance
-DGGML_CUDA_FORCE_CUBLAS=ON

Verifying Build Quality

Check CUDA Features

# Run llama-server with --help
./bin/llama-server --help | grep -i cuda

# Should show:
# --flash-attn               enable FlashAttention
# --cuda-graphs              use CUDA graphs for better performance

Performance Test

# Download a small test model
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Run benchmark
./bin/llama-bench \
  -m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -ngl 99 \
  -p 512 \
  -n 128

# Expected output should show:
# - GPU layers: 99
# - Tokens/sec > 200
# - Using CUDA compute 7.5

Memory Bandwidth Test

# Check if using Tensor Cores
./bin/llama-server \
  -m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -ngl 99 \
  -v

# Look for log output:
# "using Tensor Cores"
# "FlashAttention enabled"

Troubleshooting Build Issues

CMake Can't Find CUDA

# Set CUDA path explicitly
export CUDA_PATH=/usr/local/cuda-12.4
export PATH=$CUDA_PATH/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH

# Retry CMake
cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=$CUDA_PATH/bin/nvcc

GCC Version Mismatch

# CUDA 12.x requires GCC 11 or 12
sudo apt-get install gcc-11 g++-11

# Use specific GCC version
cmake .. -DLLAMA_CUDA=ON -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11

Out of Memory During Build

# Reduce parallel jobs
cmake --build . --config Release -j2

# Or build sequentially
cmake --build . --config Release -j1

Missing libcudart.so

# Add CUDA lib to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Verify
ldd bin/llama-server | grep cuda

Contributing Built Binaries

If you build optimized binaries for different configurations:

  1. Test thoroughly:
  2. Run on clean Colab instance
  3. Verify all models load correctly
  4. Benchmark performance

  5. Create GitHub issue:

  6. Describe build configuration
  7. Share build flags used
  8. Include benchmark results

  9. Upload to GitHub Release:

  10. Fork llcuda repository
  11. Create pull request with build documentation

Next Steps

Resources

Reference Build Configuration

For reproducible builds, here's the exact configuration used for llcuda v2.0.6 binaries (compatible with v2.1.0+):

cmake .. \
  -DLLAMA_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_DMMV_F16=ON \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=75 \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=gcc-11 \
  -DCMAKE_CXX_COMPILER=g++-11 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc

cmake --build . --config Release -j$(nproc)

Environment: - CUDA: 12.4 - GCC: 11.4 - CMake: 3.27 - llama.cpp: commit b1698 - GPU: Tesla T4