The b9093 release of llama.cpp has been announced, featuring expanded support for multiple platforms and architectures. This update includes builds for macOS, Linux, Windows, and Android, catering to both CPU and GPU configurations. Key additions include ROCm 7.2 support for Ubuntu x64 and CUDA 12 and 13 support for Windows x64, enhancing compatibility with AMD and NVIDIA GPUs. While no new models are introduced, the release emphasizes broader accessibility and usability for developers across various systems.
Read originalThe b9087 release of llama.cpp introduces significant improvements in SYCL support, focusing on the reordering of MMVQ paths for Q5_K and Q8_0. This update, led by Intel's Chun Tao, aims to optimize performance across macOS, Linux, and Windows environments. By refining these pathways, the release enhances the tool's compatibility and efficiency for developers working with different hardware configurations. Although it doesn't bring new models to the table, it reinforces llama.cpp's position as a flexible tool for AI inference, catering to a wide range of technical setups.
The latest llama.cpp update tackles a performance bottleneck by integrating BF16 support into the SYCL backend's GET_ROWS operation. This change eliminates the need for GPU-to-CPU tensor transfers for models using BF16 embedding tensors, such as Gemma4's per_layer_token_embd.weight. By utilizing the existing get_rows_sycl_float template with sycl::ext::oneapi::bfloat16, the update mirrors the approach used for F16 and F32 data types. This enhancement ensures more efficient processing and improved performance for developers working with BF16 models on systems like macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13. The update is a significant step forward for those leveraging BF16 models, providing a smoother and more streamlined experience.
The v0.18.2rc0 release includes a fix for handling the max_pixels parameter in the PaddleOCR-VL image processor across transformations.