The b9159 release of llama.cpp introduces expanded platform support, including macOS, Linux, Windows, and Android. Notable additions include compatibility with Apple Silicon, Vulkan, ROCm 7.2, and CUDA 13, catering to a wide range of hardware setups. This update enhances the tool's accessibility for developers looking to perform AI inference tasks across various systems. Although no new model architectures are introduced, the release underscores llama.cpp's commitment to broadening its usability across different platforms.
Read originalThe latest llama.cpp release, b9145, tackles a significant issue with SYCL's memory allocation on multi-GPU systems, particularly those using Intel Arc Pro GPUs. By replacing sycl::malloc_device with zeMemAllocDevice, the update drastically reduces system RAM usage from 60 GiB to just 6.7 GiB for a 15.6 GiB model, preventing out-of-memory crashes without sacrificing performance. This change is crucial for developers working with large models on multi-GPU setups, as it ensures more efficient memory management. The update also includes several improvements and bug fixes, enhancing the robustness of the SYCL backend.
Llama.cpp's latest release enhances its capabilities with a non-backtracking tokenizer handler specifically designed for Qwen3.5. This update significantly improves Unicode tokenization, addressing stack overflow issues that occur with long inputs. By adapting the previous Qwen2 fix to meet Qwen3.5's regex requirements, including support for accent marks, the update ensures more reliable text processing. Developers can now expect more stable performance when handling complex Unicode inputs, benefiting from the robust tokenization across different operating systems and hardware configurations. This means smoother operations on platforms like macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13.