Llama.cpp's b9075 release focuses on optimizing CUDA performance by fusing the snake activation function into a single kernel. This change targets audio decoders such as BigVGAN and Vocos, which previously used a more complex operation sequence. The update supports F32, F16, and BF16 data types, enhancing efficiency and performance. This release highlights llama.cpp's ongoing efforts to improve CUDA functionality, offering developers a more streamlined and effective tool for handling intricate activation functions.
Read originalThe b9073 release of llama.cpp marks a significant expansion in platform compatibility, enhancing its accessibility across various operating systems. With KleidiAI now enabled for macOS Apple Silicon, M-series Mac users can expect improved performance. The update also includes builds for Ubuntu featuring ROCm 7.2 and OpenVINO, alongside Windows versions with CUDA 12 and 13, reflecting a commitment to supporting diverse hardware. This positions llama.cpp as a versatile inference runtime, catering to developers across different environments without introducing new model architectures.
The latest b9076 release of llama.cpp quietly expands its platform support, making it more versatile for developers across various systems. Notably, it now exposes child model information from the router's /v1/models endpoint, enhancing transparency and control for users. The update includes support for macOS Apple Silicon with KleidiAI enabled, as well as expanded compatibility with Ubuntu and Windows systems, including Vulkan and ROCm 7.2. This release doesn't introduce new models but strengthens llama.cpp's position as a flexible inference runtime across diverse hardware configurations.