Llama.cpp has released an update that includes a new non-backtracking tokenizer handler for Qwen3.5. This enhancement addresses stack overflow issues by improving Unicode tokenization, particularly for long inputs. The update mirrors a previous fix for Qwen2 but is tailored for Qwen3.5's regex requirements. This change is significant for developers working with complex text inputs, ensuring more reliable and efficient processing across multiple platforms.
Read originalThe latest llama.cpp release, b9145, tackles a significant issue with SYCL's memory allocation on multi-GPU systems, particularly those using Intel Arc Pro GPUs. By replacing sycl::malloc_device with zeMemAllocDevice, the update drastically reduces system RAM usage from 60 GiB to just 6.7 GiB for a 15.6 GiB model, preventing out-of-memory crashes without sacrificing performance. This change is crucial for developers working with large models on multi-GPU setups, as it ensures more efficient memory management. The update also includes several improvements and bug fixes, enhancing the robustness of the SYCL backend.
The b9150 release of llama.cpp continues its trend of broadening platform compatibility, now including support for macOS Apple Silicon with KleidiAI enabled and a variety of Linux configurations such as Ubuntu with ROCm 7.2 and Vulkan. This update also enhances Windows support with CUDA 12 and 13 DLLs, making it more versatile for developers working across different environments. While there are no groundbreaking new features, the release solidifies llama.cpp's position as a flexible inference runtime for diverse hardware setups. Developers can now leverage these updates to optimize performance across a wider range of systems.