The b9145 release of llama.cpp introduces a fix for SYCL's memory allocation issues on multi-GPU systems. By switching from sycl::malloc_device to zeMemAllocDevice, the update significantly reduces system RAM usage, preventing out-of-memory crashes on systems like the dual Intel Arc Pro B70. This change ensures efficient memory management without performance loss. The release also includes various improvements and bug fixes, enhancing the overall stability and functionality of the SYCL backend.
Read originalLlama.cpp's latest release enhances its capabilities with a non-backtracking tokenizer handler specifically designed for Qwen3.5. This update significantly improves Unicode tokenization, addressing stack overflow issues that occur with long inputs. By adapting the previous Qwen2 fix to meet Qwen3.5's regex requirements, including support for accent marks, the update ensures more reliable text processing. Developers can now expect more stable performance when handling complex Unicode inputs, benefiting from the robust tokenization across different operating systems and hardware configurations. This means smoother operations on platforms like macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13.
The b9150 release of llama.cpp continues its trend of broadening platform compatibility, now including support for macOS Apple Silicon with KleidiAI enabled and a variety of Linux configurations such as Ubuntu with ROCm 7.2 and Vulkan. This update also enhances Windows support with CUDA 12 and 13 DLLs, making it more versatile for developers working across different environments. While there are no groundbreaking new features, the release solidifies llama.cpp's position as a flexible inference runtime for diverse hardware setups. Developers can now leverage these updates to optimize performance across a wider range of systems.