Models & Labs

HMX Flash Attention Update Released

llama.cpp ReleasesMay 2, 2026medium confidence

Why it matters

→Enhancements in flash attention can lead to better performance in AI models. • The update addresses critical correctness issues, ensuring reliable outputs. • Optimizations may reduce computational overhead, benefiting developers working with large datasets.

The HMX Flash Attention update introduces multiple enhancements, including HMX-accelerated flash attention for prefill and optimizations for multi-threading and memory management. Key changes involve replacing inline assembly with Q6 intrinsics for better compiler visibility, refining cost model coefficients based on profiling data, and fixing correctness issues related to prefill and softmax operations. This update aims to improve performance and maintainability of the HMX Flash Attention implementation, making it more efficient for users.

Read original

HMX Flash Attention Update Released

Why it matters

More from llama.cpp Releases

b9002 Release for Llama.cpp

b9004 Release for Multiple Platforms

More in Models & Labs

DeepSeek V4 Pro Launches with 1.6T Parameters

DeepSeek V4 Preview Released

Llama.cpp Releases OpenCL Optimization Update

NVIDIA Launches Nemotron 3 Nano Omni