Building CUDA Binaries for llcuda¶
This tutorial shows how to build optimized CUDA binaries compatible with llcuda v2.1.0+ on Tesla T4 GPUs. The binaries are built from llama.cpp with FlashAttention and CUDA 12 support.
Advanced Topic
This guide is for advanced users who want to customize binaries or contribute to llcuda development. Regular users should use the pre-built binaries that auto-download from GitHub Releases.
Overview¶
llcuda v2.1.0 uses pre-built v2.0.6 CUDA binaries based on llama.cpp with these optimizations:
Binary Compatibility
The v2.0.6 binaries are fully compatible with llcuda v2.1.0 and later versions. The binary format remains stable while the Python API receives new features.
- FlashAttention - 2-3x faster attention for long contexts
- CUDA Graphs - Reduced kernel launch overhead
- Tensor Cores - INT4/INT8 hardware acceleration (SM 7.5)
- cuBLAS - Optimized matrix multiplication
- SM 7.5 targeting - Tesla T4 specific optimizations
Prerequisites¶
System Requirements¶
- GPU: Tesla T4 (SM 7.5) - Google Colab recommended
- CUDA: 12.0 or higher (12.4 recommended)
- GCC: 11.x or 12.x
- CMake: 3.18+
- Disk Space: 5 GB for build artifacts
- RAM: 8 GB minimum
- Time: 20-30 minutes on T4
Software Dependencies¶
# Update system
sudo apt-get update
sudo apt-get install -y build-essential cmake git wget
# Verify CUDA installation
nvcc --version
nvidia-smi
# Verify GCC version
gcc --version # Should be 11.x or 12.x
Build Process¶
Method 1: Using Google Colab Notebook¶
The easiest way to build binaries is using the provided Colab notebook:
- Open the build notebook:
-
Select T4 GPU runtime:
-
Runtime > Change runtime type > GPU > T4
-
Run all cells:
-
Runtime > Run all
-
Download built binaries:
- Files will be in
/content/llcuda-binaries-cuda12-t4-v2.0.6.tar.gz
Method 2: Manual Build¶
For local systems with Tesla T4:
Step 1: Clone llama.cpp¶
# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Checkout stable commit (optional but recommended)
git checkout b1698 # Replace with known-good commit
Step 2: Configure CMake for T4¶
# Create build directory
mkdir build
cd build
# Configure with CUDA 12 and FlashAttention
cmake .. \
-DLLAMA_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_CUDA_GRAPHS=ON \
-DGGML_CUDA_DMMV_F16=ON \
-DGGML_CUDA_FORCE_MMQ=OFF \
-DGGML_CUDA_FORCE_CUBLAS=ON \
-DCMAKE_CUDA_ARCHITECTURES=75 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
Key CMake flags explained:
| Flag | Purpose |
|---|---|
LLAMA_CUDA=ON | Enable CUDA support |
GGML_CUDA_FA_ALL_QUANTS=ON | FlashAttention for all quantization types |
GGML_CUDA_GRAPHS=ON | CUDA graphs for reduced overhead |
CMAKE_CUDA_ARCHITECTURES=75 | Target Tesla T4 (SM 7.5) |
GGML_CUDA_FORCE_CUBLAS=ON | Use cuBLAS for matrix operations |
Step 3: Build Binaries¶
# Build with parallel jobs
cmake --build . --config Release -j$(nproc)
# Expected output:
# [100%] Built target llama-server
# [100%] Built target llama-cli
# [100%] Built target llama-quantize
# [100%] Built target llama-embedding
Build time: ~20-25 minutes on T4
Step 4: Verify Built Binaries¶
# Check binaries exist
ls -lh bin/
# Expected files:
# llama-server (~6.5 MB)
# llama-cli (~4.2 MB)
# llama-quantize (~434 KB)
# llama-embedding (~3.3 MB)
# llama-bench (~581 KB)
# Check library files
ls -lh src/
# Expected libraries:
# libggml-cuda.so (~221 MB)
# libllama.so (~15 MB)
# libggml-base.so (~8 MB)
# libggml-cpu.so (~6 MB)
Step 5: Test Binary¶
# Quick test (should show CUDA support)
./bin/llama-server --version
# Expected output:
# llama-server: built with CUDA 12.4 for compute capability 7.5
# FlashAttention: enabled
# CUDA graphs: enabled
Method 3: Using Build Script¶
llcuda includes a build script for automation:
# Clone llcuda repository
git clone https://github.com/waqasm86/llcuda.git
cd llcuda/scripts
# Run build script
chmod +x build_t4_binaries.sh
./build_t4_binaries.sh
# Binaries will be in ../build-artifacts/
Creating Distribution Package¶
After building, create a distribution package:
Step 1: Organize Files¶
# Create package directory structure
mkdir -p llcuda-binaries-cuda12-t4/bin
mkdir -p llcuda-binaries-cuda12-t4/lib
# Copy binaries
cd llama.cpp/build
cp bin/llama-server ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-cli ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-quantize ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-embedding ../../../llcuda-binaries-cuda12-t4/bin/
cp bin/llama-bench ../../../llcuda-binaries-cuda12-t4/bin/
# Copy libraries
cp src/libggml-cuda.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libllama.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libggml-base.so ../../../llcuda-binaries-cuda12-t4/lib/
cp src/libggml-cpu.so ../../../llcuda-binaries-cuda12-t4/lib/
Step 2: Create Tarball¶
# Create compressed archive
cd ../../..
tar -czf llcuda-binaries-cuda12-t4-v2.0.6.tar.gz llcuda-binaries-cuda12-t4/
# Verify size (~266 MB)
ls -lh llcuda-binaries-cuda12-t4-v2.0.6.tar.gz
# Calculate SHA256 checksum
sha256sum llcuda-binaries-cuda12-t4-v2.0.6.tar.gz > llcuda-binaries-cuda12-t4-v2.0.6.tar.gz.sha256
Step 3: Test Package¶
# Extract to test
mkdir test-extract
cd test-extract
tar -xzf ../llcuda-binaries-cuda12-t4-v2.0.6.tar.gz
# Test llama-server
export LD_LIBRARY_PATH=llcuda-binaries-cuda12-t4/lib:$LD_LIBRARY_PATH
./llcuda-binaries-cuda12-t4/bin/llama-server --version
# Should show CUDA 12.x, SM 7.5, FlashAttention enabled
Build Configuration Options¶
FlashAttention Variants¶
# FlashAttention for all quantizations (default)
-DGGML_CUDA_FA_ALL_QUANTS=ON
# FlashAttention for FP16 only (smaller binary)
-DGGML_CUDA_FA_ALL_QUANTS=OFF
# Disable FlashAttention (not recommended)
# Remove -DGGML_CUDA_FA_ALL_QUANTS flag
Compute Capability Targeting¶
# Tesla T4 only (SM 7.5)
-DCMAKE_CUDA_ARCHITECTURES=75
# Multiple architectures (larger binary)
-DCMAKE_CUDA_ARCHITECTURES="70;75;80;86"
# All modern architectures
-DCMAKE_CUDA_ARCHITECTURES="70;75;80;86;89;90"
Memory Optimizations¶
# Use FP16 for DMMV (faster, more memory)
-DGGML_CUDA_DMMV_F16=ON
# Force matrix multiplication kernels
-DGGML_CUDA_FORCE_MMQ=OFF
# Use cuBLAS for better performance
-DGGML_CUDA_FORCE_CUBLAS=ON
Verifying Build Quality¶
Check CUDA Features¶
# Run llama-server with --help
./bin/llama-server --help | grep -i cuda
# Should show:
# --flash-attn enable FlashAttention
# --cuda-graphs use CUDA graphs for better performance
Performance Test¶
# Download a small test model
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# Run benchmark
./bin/llama-bench \
-m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-ngl 99 \
-p 512 \
-n 128
# Expected output should show:
# - GPU layers: 99
# - Tokens/sec > 200
# - Using CUDA compute 7.5
Memory Bandwidth Test¶
# Check if using Tensor Cores
./bin/llama-server \
-m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-ngl 99 \
-v
# Look for log output:
# "using Tensor Cores"
# "FlashAttention enabled"
Troubleshooting Build Issues¶
CMake Can't Find CUDA¶
# Set CUDA path explicitly
export CUDA_PATH=/usr/local/cuda-12.4
export PATH=$CUDA_PATH/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH
# Retry CMake
cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=$CUDA_PATH/bin/nvcc
GCC Version Mismatch¶
# CUDA 12.x requires GCC 11 or 12
sudo apt-get install gcc-11 g++-11
# Use specific GCC version
cmake .. -DLLAMA_CUDA=ON -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11
Out of Memory During Build¶
# Reduce parallel jobs
cmake --build . --config Release -j2
# Or build sequentially
cmake --build . --config Release -j1
Missing libcudart.so¶
# Add CUDA lib to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Verify
ldd bin/llama-server | grep cuda
Contributing Built Binaries¶
If you build optimized binaries for different configurations:
- Test thoroughly:
- Run on clean Colab instance
- Verify all models load correctly
-
Benchmark performance
-
Create GitHub issue:
- Describe build configuration
- Share build flags used
-
Include benchmark results
-
Upload to GitHub Release:
- Fork llcuda repository
- Create pull request with build documentation
Next Steps¶
- Performance Optimization - Tune runtime parameters
- Benchmarks - Compare performance
- Troubleshooting - Fix common issues
- Unsloth Integration - Deploy fine-tuned models
Resources¶
- Build Notebook: build_llcuda_v2_t4_colab.ipynb
- llama.cpp: GitHub
- CUDA Toolkit: Download
- CMake Documentation: cmake.org
Reference Build Configuration¶
For reproducible builds, here's the exact configuration used for llcuda v2.0.6 binaries (compatible with v2.1.0+):
cmake .. \
-DLLAMA_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_CUDA_GRAPHS=ON \
-DGGML_CUDA_DMMV_F16=ON \
-DGGML_CUDA_FORCE_MMQ=OFF \
-DGGML_CUDA_FORCE_CUBLAS=ON \
-DCMAKE_CUDA_ARCHITECTURES=75 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc-11 \
-DCMAKE_CXX_COMPILER=g++-11 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc
cmake --build . --config Release -j$(nproc)
Environment: - CUDA: 12.4 - GCC: 11.4 - CMake: 3.27 - llama.cpp: commit b1698 - GPU: Tesla T4