
Hugging Face has released two new multilingual embedding models under the Apache 2.0 license. The Granite Embedding Multilingual R2 models include a 97M-parameter compact model and a 311M full-size model, both supporting over 200 languages. The compact model achieves the highest retrieval score for any open multilingual model under 100M parameters, while the full-size model ranks second among models under 500M parameters. These models are designed for broad language coverage and high retrieval quality, making them suitable for diverse multilingual and code retrieval tasks.
Read originalThe latest llama.cpp release, b9145, tackles a significant issue with SYCL's memory allocation on multi-GPU systems, particularly those using Intel Arc Pro GPUs. By replacing sycl::malloc_device with zeMemAllocDevice, the update drastically reduces system RAM usage from 60 GiB to just 6.7 GiB for a 15.6 GiB model, preventing out-of-memory crashes without sacrificing performance. This change is crucial for developers working with large models on multi-GPU setups, as it ensures more efficient memory management. The update also includes several improvements and bug fixes, enhancing the robustness of the SYCL backend.
Llama.cpp's latest release enhances its capabilities with a non-backtracking tokenizer handler specifically designed for Qwen3.5. This update significantly improves Unicode tokenization, addressing stack overflow issues that occur with long inputs. By adapting the previous Qwen2 fix to meet Qwen3.5's regex requirements, including support for accent marks, the update ensures more reliable text processing. Developers can now expect more stable performance when handling complex Unicode inputs, benefiting from the robust tokenization across different operating systems and hardware configurations. This means smoother operations on platforms like macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13.