Models & Labs

llama.cpp b9767 Release Enhances MTP Inference

llama.cpp ReleasesJune 24, 2026high confidence

Why it matters

→Enhances MTP inference efficiency for small batches, improving performance.
→Adds a barrier to the NUM_COLS loop, potentially increasing computational efficiency.
→Expands platform compatibility, making llama.cpp more versatile for developers.

The b9767 release of llama.cpp focuses on improving MTP inference by utilizing a mat-vec path for small batches, which enhances decoding performance. A barrier has been added to the NUM_COLS loop in the mul-mat-vec process, potentially boosting efficiency. This update supports a wide range of platforms, including macOS, Linux, and Windows, but does not introduce new model architectures. The release highlights ongoing efforts to optimize performance and expand compatibility, reinforcing llama.cpp's utility for developers.

Read original

More from llama.cpp Releases

Models & Labsmodels

Granite Speech Plus Support Added in b9768 Release

The b9768 release of llama.cpp expands its capabilities by integrating Granite Speech Plus, which enhances audio processing with multi-layer concatenation. This update is particularly relevant for developers focused on audio applications, as it resolves naming inconsistencies and standardizes feature layer usage. While no new models are introduced, the release fortifies the existing framework, making it more reliable for audio tasks. This iteration marks a refinement in the tool's functionality, especially for those utilizing its audio features.

llama.cpp ReleasesJun 24, 2026

Open Sourcemodels

llama.cpp b9771 Release Trims Shader Variants

The b9771 release of llama.cpp brings a notable optimization by setting 'mul_mm ALIGNED' as a spec constant, effectively reducing the shader variant explosion and cutting down the binary size. This change is particularly advantageous for developers using Vulkan, as it simplifies the compilation process. While the update doesn't introduce new features, it continues to enhance the platform's compatibility across macOS, Linux, Windows, and openEuler. This release is a step forward in making llama.cpp more efficient and accessible for developers working with different hardware setups, including Apple Silicon, ROCm, and CUDA environments.

llama.cpp ReleasesJun 24, 2026

Open Sourcemodels

llama.cpp b9773 Release Expands Platform Support

The b9773 release of llama.cpp continues its trend of broadening platform compatibility, though without major new features. Notably, it includes support for ROCm 7.2 on Ubuntu x64, which is significant for AMD GPU users seeking alternatives to NVIDIA's CUDA. The release also maintains a wide array of builds across macOS, Linux, Windows, and openEuler, ensuring that developers can deploy llama.cpp in many different computing environments. While the update doesn't introduce groundbreaking changes, it solidifies llama.cpp's position as a versatile tool for AI inference across multiple systems.

llama.cpp ReleasesJun 24, 2026

More in Models & Labs

Models & Labsagents

NVIDIA Launches Agent Toolkit for Specialized AI

NVIDIA's new Agent Toolkit is a significant step towards creating specialized AI agents that can be customized and trusted by enterprises. By providing a modular foundation of models, tools, and secure runtime, the toolkit allows businesses to build AI systems tailored to their specific workflows. This development is particularly impactful in industries like life sciences and healthcare, where AI agents can drastically reduce the time needed for complex tasks such as protein design and clinical documentation. The toolkit's open nature ensures that companies can integrate these agents into existing systems, enhancing efficiency and control.

NVIDIA BlogJun 23, 2026

Models & Labsmodels

Sakana AI launches Fugu for model orchestration

Sakana AI's Fugu model introduces a novel approach to AI usage by coordinating multiple models through a single API, addressing challenges like those posed by export controls on Anthropic's models. Fugu is available in two versions: a faster model for everyday tasks and a more robust version for complex applications such as patent research. While Sakana asserts that Fugu performs comparably to leading models, initial feedback suggests it may not yet achieve those standards. This launch represents a shift towards model orchestration, though questions about cost and transparency remain unresolved.

The Rundown AIJun 23, 2026

Models & Labscoding

Cross-Origin Storage API in Transformers.js

The proposed Cross-Origin Storage API could revolutionize how web apps handle large files across different origins by using cryptographic hashes instead of URLs for identification. This approach aims to eliminate redundant downloads and storage, which is currently a challenge due to browser cache isolation by origin. By allowing shared resources like AI models and Wasm files to be recognized across different apps, this API could significantly reduce bandwidth and storage usage. Although still in early stages and not natively supported by browsers, developers can experiment with it using a polyfill extension.

Hugging Face BlogJun 23, 2026