The b9129 release of llama.cpp brings an adaptive fallback feature to the ggml-zendnn backend, which defaults to the CPU for small batch sizes to enhance performance. This update is enabled by default and can be controlled via a new runtime environment variable. The release supports a wide array of platforms, including macOS, Windows, and Linux, ensuring broad compatibility. This development is aimed at optimizing processing efficiency across different hardware setups.
Read originalThe latest b9133 release of llama.cpp introduces significant improvements for reasoning models, particularly in server and web UI environments. By removing the blocking assistant prefill and orchestrating thinking tags, the update ensures smoother continuation of generation tasks. This release also drops the reasoning guard on the Continue button, allowing for persistent reasoning content even after reloads. While the update focuses on templates with simple thinking tags, it sets the stage for future enhancements in reasoning model capabilities.
The latest b9134 release of llama.cpp continues its trend of broadening platform compatibility, making it a versatile tool for developers across various systems. This update includes support for macOS Apple Silicon with KleidiAI enabled, as well as expanded Vulkan and ROCm 7.2 support on Ubuntu. Windows users benefit from updated CUDA 12 and 13 DLLs, enhancing performance for GPU tasks. While no new models are introduced, the release solidifies llama.cpp's position as a flexible inference runtime across diverse hardware configurations.