The b9100 release of llama.cpp focuses on backend sampling improvements, notably supporting the return of post-sampling probabilities. This ensures that probabilities are not returned as zero, enhancing the accuracy of sampling results. The update also extends platform compatibility, covering macOS, Linux, Windows, and Android, with specific support for technologies like Vulkan and ROCm. This release is a step forward in making llama.cpp a more robust and reliable tool for developers working with AI models.
Read originalThe b9094 release of llama.cpp marks a significant expansion in platform support, particularly for macOS and Windows users. With the inclusion of KleidiAI enabled builds for Apple Silicon, macOS users gain enhanced performance without additional configuration. Windows users benefit from the addition of CUDA 12 and 13 support, broadening the scope for GPU-accelerated tasks. This release doesn't introduce new models but focuses on making llama.cpp more accessible and versatile across a wider range of systems, reinforcing its position as a go-to inference runtime for diverse hardware setups.
The latest b9095 release of llama.cpp introduces a significant update with an internal AllReduce kernel for CUDA, eliminating the need for NCCL in certain configurations. This update allows for a single-phase CUDA kernel that efficiently manages data transfer and reduction across GPUs, specifically targeting setups with two GPUs and FP32 tensors up to 256 KB. By providing an alternative to NCCL, this release offers more flexibility and potentially reduces dependencies for developers working with tensor parallelism. The update also includes improvements in error logging and a new watchdog feature to detect and address hangs, enhancing the robustness of the system.