Models & Labs

llama.cpp b9019 Release Enhances Model Flexibility

llama.cpp ReleasesMay 5, 2026high confidence

Why it matters

→Moving functions to per-model definitions increases flexibility for developers.
→The update enhances the system's adaptability to various hardware configurations.
→It prepares the codebase for more efficient future developments.

The b9019 release of llama.cpp focuses on improving the modularity and flexibility of the codebase. Key functions such as load_hparams and load_tensors have been moved to per-model definitions, allowing for more tailored model handling. The update also includes the addition of build_graph and improvements to switch case logic, enhancing the system's adaptability to different hardware setups. This release supports a wide range of platforms, including macOS, Linux, and Windows, but does not introduce new model architectures.

Read original

More from llama.cpp Releases

Open Sourcemodels

llama.cpp b9015 Release Expands Platform Support

The b9015 release of llama.cpp marks another step in expanding its reach across diverse systems, now including macOS Apple Silicon with KleidiAI enabled and Ubuntu with ROCm 7.2. This update also brings Vulkan support to both Linux and Windows, enhancing the software's versatility. Windows users benefit from CUDA 12 and 13 support, ensuring compatibility with the latest NVIDIA technologies. While the release doesn't introduce new model architectures, it strengthens llama.cpp's role as a flexible inference runtime for developers working with varied hardware configurations.

llama.cpp ReleasesMay 5, 2026

Models & Labsmodels

llama.cpp b9018 release expands platform support

The b9018 release of llama.cpp continues its trend of broadening platform compatibility, now supporting a wide array of systems including macOS, Linux, Windows, and Android. Notably, it introduces Vulkan support on Ubuntu and Windows, and adds ROCm 7.2 for AMD GPUs, which is a significant step for users seeking alternatives to NVIDIA's CUDA. This release doesn't bring new models or quantization methods, but it solidifies llama.cpp's position as a versatile inference runtime across diverse hardware configurations. Users can now leverage these enhancements to optimize performance on their specific setups.

llama.cpp ReleasesMay 5, 2026

Models & Labsmodels

llama.cpp b9025 Release Expands Platform Support

The latest b9025 release of llama.cpp continues its trend of broadening platform compatibility, now supporting a wide array of systems including macOS, Linux, Windows, and Android. Notably, it introduces Vulkan support on Ubuntu and Windows, and adds ROCm 7.2 for Ubuntu, enhancing GPU performance options. This release doesn't introduce new models but focuses on making llama.cpp a versatile tool across different hardware configurations. By expanding its reach, llama.cpp is positioning itself as a go-to runtime for diverse computing environments, ensuring developers can leverage its capabilities regardless of their platform choice.

llama.cpp ReleasesMay 5, 2026

More in Models & Labs

Models & Labsmodels

Google unveils major AI advancements at Cloud Next '26

Google's Cloud Next '26 event showcased significant advancements in AI, emphasizing the 'agentic era' with the launch of the Gemini Enterprise Agent Platform and eighth-generation TPUs. These innovations aim to enhance business operations and energy efficiency in data centers. The introduction of Gemma 4, an open model for advanced reasoning, and Deep Research Max, which automates high-level research tasks, marks a leap in AI capabilities. Additionally, Google Vids now offers free video generation, democratizing access to professional-quality content creation. These developments highlight Google's commitment to integrating AI into diverse sectors, from education to enterprise solutions.

Google AI BlogMay 4, 2026

Models & Labsagents

Gemini API Introduces Webhooks for Long-Running Jobs

Google's Gemini API now supports event-driven Webhooks, significantly reducing friction and latency for long-running tasks. This new feature allows developers to receive real-time notifications when a job is completed, eliminating the need for continuous polling. The implementation adheres to the Standard Webhooks specification, ensuring secure and reliable communication with features like signed requests and automatic retries. This advancement makes it easier for developers to manage complex workflows, such as deep research or batch processing, with greater efficiency.

Google AI BlogMay 4, 2026

Models & Labsmodels

vLLM v0.20.2rc0 introduces shutdown() method

The latest release of vLLM, version 0.20.2rc0, brings a new shutdown() method, enhancing the control developers have over the lifecycle of their applications. This addition is a practical improvement for those managing resources and ensuring clean exits in their AI systems. While it may seem like a small update, it reflects a focus on robustness and reliability in AI infrastructure. Developers can now better manage their applications, reducing potential issues during shutdown processes.

vLLM ReleasesMay 4, 2026