Llama.cpp's b9095 release introduces an internal AllReduce kernel for CUDA, offering a NCCL-free implementation for tensor parallelism. This update supports configurations with two GPUs and FP32 tensors up to 256 KB, providing an alternative to NCCL and enhancing flexibility. The release also improves error logging and introduces a watchdog feature to detect hangs, aiming to increase system reliability. These changes are designed to streamline operations and reduce dependencies for developers using llama.cpp in tensor-parallel modes.
Read originalThe b9094 release of llama.cpp marks a significant expansion in platform support, particularly for macOS and Windows users. With the inclusion of KleidiAI enabled builds for Apple Silicon, macOS users gain enhanced performance without additional configuration. Windows users benefit from the addition of CUDA 12 and 13 support, broadening the scope for GPU-accelerated tasks. This release doesn't introduce new models but focuses on making llama.cpp more accessible and versatile across a wider range of systems, reinforcing its position as a go-to inference runtime for diverse hardware setups.
The b9097 release of llama.cpp continues its trend of broadening platform compatibility, now including support for macOS Apple Silicon with KleidiAI enabled and various Linux configurations like Ubuntu with Vulkan and ROCm 7.2. This update also enhances Windows support with CUDA 12 and 13 DLLs, making it more versatile for developers working across different environments. While there are no groundbreaking new features, the release solidifies llama.cpp's position as a flexible inference runtime. Developers can now leverage these updates to optimize performance across a wider range of hardware setups.