The latest b9094 release of llama.cpp significantly broadens its platform compatibility, especially for macOS and Windows. It introduces KleidiAI enabled builds for Apple Silicon, enhancing performance for macOS users. Windows users now have access to CUDA 12 and 13 support, facilitating GPU-accelerated tasks. While no new models are introduced, this update focuses on expanding accessibility and versatility across various hardware configurations, solidifying llama.cpp's role as a versatile inference runtime.
Read originalThe latest b9095 release of llama.cpp introduces a significant update with an internal AllReduce kernel for CUDA, eliminating the need for NCCL in certain configurations. This update allows for a single-phase CUDA kernel that efficiently manages data transfer and reduction across GPUs, specifically targeting setups with two GPUs and FP32 tensors up to 256 KB. By providing an alternative to NCCL, this release offers more flexibility and potentially reduces dependencies for developers working with tensor parallelism. The update also includes improvements in error logging and a new watchdog feature to detect and address hangs, enhancing the robustness of the system.
The b9097 release of llama.cpp continues its trend of broadening platform compatibility, now including support for macOS Apple Silicon with KleidiAI enabled and various Linux configurations like Ubuntu with Vulkan and ROCm 7.2. This update also enhances Windows support with CUDA 12 and 13 DLLs, making it more versatile for developers working across different environments. While there are no groundbreaking new features, the release solidifies llama.cpp's position as a flexible inference runtime. Developers can now leverage these updates to optimize performance across a wider range of hardware setups.
The v0.18.2rc0 release includes a fix for handling the max_pixels parameter in the PaddleOCR-VL image processor across transformations.