The HMX Flash Attention update introduces multiple enhancements, including HMX-accelerated flash attention for prefill and optimizations for multi-threading and memory management. Key changes involve replacing inline assembly with Q6 intrinsics for better compiler visibility, refining cost model coefficients based on profiling data, and fixing correctness issues related to prefill and softmax operations. This update aims to improve performance and maintainability of the HMX Flash Attention implementation, making it more efficient for users.
Read originalThe b9002 version of Llama.cpp has been released, supporting multiple platforms.
The b9004 release of llama.cpp introduces support for various platforms including macOS, Linux, Android, and Windows.
The latest update to Llama.cpp includes optimizations for MoE on Adreno GPUs and various fixes across platforms.