FlashAttention¶

FlashAttention v2 optimization in llcuda.

What is FlashAttention?¶

Memory-efficient attention algorithm: - 2-3x faster than standard attention - Lower memory usage - Exact (not approximate)

config = ServerConfig(
    flash_attn=True,  # Enable FlashAttention
)

Model	Without FA	With FA	Speedup
7B	~15 tok/s	~35 tok/s	2.3x
13B	~8 tok/s	~18 tok/s	2.3x
70B	~5 tok/s	~12 tok/s	2.4x