Models & Labs

Llama.cpp b9087 Release Enhances SYCL Support

llama.cpp ReleasesMay 10, 2026high confidence

Why it matters

→Enhances SYCL support, improving performance across platforms.
→Strengthens llama.cpp's versatility for AI inference.
→Optimizes existing pathways without introducing new models.

The b9087 release of llama.cpp focuses on enhancing SYCL support, particularly through the reordering of MMVQ paths for Q5_K and Q8_0. This update, contributed by Intel's Chun Tao and Todd Malsbary, aims to improve performance across platforms like macOS, Linux, and Windows. The release does not introduce new models but strengthens llama.cpp's utility in AI inference by optimizing existing pathways. This makes it a more robust option for developers working with various hardware setups.

Read original

More from llama.cpp Releases

Models & Labsmodels

BF16 Support Added to llama.cpp SYCL Backend

The latest llama.cpp update tackles a performance bottleneck by integrating BF16 support into the SYCL backend's GET_ROWS operation. This change eliminates the need for GPU-to-CPU tensor transfers for models using BF16 embedding tensors, such as Gemma4's per_layer_token_embd.weight. By utilizing the existing get_rows_sycl_float template with sycl::ext::oneapi::bfloat16, the update mirrors the approach used for F16 and F32 data types. This enhancement ensures more efficient processing and improved performance for developers working with BF16 models on systems like macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13. The update is a significant step forward for those leveraging BF16 models, providing a smoother and more streamlined experience.

llama.cpp ReleasesMay 10, 2026

Models & Labsmodels

llama.cpp b9089 release focuses on SYCL improvements

The latest b9089 release of llama.cpp brings notable improvements in SYCL, specifically reducing allocation overhead during flash attention. This update refines the handling of memory allocation, which can enhance performance for developers using SYCL. Additionally, the release includes various platform-specific builds, such as macOS Apple Silicon and Windows with CUDA support, ensuring broad compatibility. While the update doesn't introduce new models, it strengthens llama.cpp's position as a versatile inference runtime across diverse hardware configurations.

llama.cpp ReleasesMay 10, 2026

Open Sourcemodels

llama.cpp b9093 Release Expands Platform Support

The b9093 release of llama.cpp marks a significant step in broadening its platform compatibility, making it more accessible to a diverse range of users. With new builds for macOS, Linux, Windows, and Android, the update ensures that developers can leverage llama.cpp across various hardware configurations, including Apple Silicon, Intel, and ARM architectures. Notably, the addition of ROCm 7.2 for Ubuntu x64 and CUDA 12 and 13 for Windows x64 demonstrates a commitment to supporting both AMD and NVIDIA GPUs. This release doesn't introduce new models but focuses on making llama.cpp a versatile tool for developers working on different systems.

llama.cpp ReleasesMay 10, 2026

More in Models & Labs

Models & Labsmodels

GitHub to Deprecate Grok Code Fast 1 Model

GitHub is moving forward with the deprecation of the Grok Code Fast 1 model across all Copilot experiences by May 15th. This change is driven by the discontinuation of the model provider, prompting users to adopt supported models. Administrators are tasked with updating workflows and enabling access to alternative models through Copilot settings to ensure seamless operation. The transition is designed to be smooth, as no manual removal of deprecated models is required. This step underscores GitHub's strategy to keep its AI tools current and efficient, ensuring users have access to the latest advancements. Enterprise customers are advised to reach out to their account managers for any concerns.

GitHub ChangelogMay 8, 2026

Models & Labsmodels

CyberSecQwen-4B: Specialized Cybersecurity Model Released

CyberSecQwen-4B is a new AI model designed specifically for defensive cybersecurity tasks, offering a balance between performance and deployability. It achieves nearly the same accuracy as larger models like Cisco's Foundation-Sec-Instruct-8B but with half the parameters, making it suitable for local deployment on consumer-grade GPUs. This model is particularly useful for tasks such as CWE classification and CTI Q&A, providing a practical solution for environments where data privacy and cost are critical. By focusing on narrow, well-defined tasks, CyberSecQwen-4B offers a specialized tool for cybersecurity professionals that can be run locally, addressing the unique challenges of the field.

Hugging Face BlogMay 8, 2026

Models & Labsmodels

Hugging Face Unveils EMO MoE Model

Hugging Face has introduced EMO, a new mixture-of-experts model that allows for emergent modularity without predefined human biases. Unlike traditional models that require the full model for optimal performance, EMO can achieve near full-model performance using only 12.5% of its experts for specific tasks. This innovation addresses the inefficiencies of large language models by enabling selective expert use, reducing computational costs while maintaining versatility. EMO's design encourages coherent expert grouping, making it a flexible and efficient tool for diverse applications.

Hugging Face BlogMay 8, 2026