Llama.cpp has released its b9113 update, which notably adds support for Q4_1 Mixture of Experts (MoE) on Adreno GPUs. This enhancement is part of a broader effort to optimize AI model execution across various platforms, including macOS, Linux, and Windows. The update also involves code refinements and the removal of unnecessary asserts, contributing to a more streamlined performance. This release underscores llama.cpp's commitment to expanding its compatibility and efficiency across different hardware environments.
Read originalThe b9103 release of llama.cpp continues its trend of broadening platform compatibility, making it a versatile tool for developers across various systems. With this update, Apple Silicon users benefit from KleidiAI support, enhancing performance on M-series Macs. The inclusion of ROCm 7.2 for Ubuntu x64 further narrows the gap between AMD and NVIDIA GPUs, offering more options for local inference. This release doesn't introduce new models but solidifies llama.cpp's position as a go-to runtime for diverse hardware configurations, ensuring developers can deploy AI models efficiently across multiple environments.
The b9105 release of llama.cpp brings a notable improvement by directly incorporating cuda/iterator, which enhances the reliability of CUDA operations. This update moves away from the previous reliance on a transient import from cub/cub.cuh, ensuring more stable performance for developers using NVIDIA GPUs. The release continues to support a broad array of platforms, including macOS with KleidiAI enabled, Linux with ROCm 7.2, and Windows with CUDA 12 and 13. While there are no new model architectures introduced, this update reinforces llama.cpp's role as a dependable tool for AI developers working across different hardware environments.