vLLM has announced the release of version 0.19.0, which includes 448 commits from 197 contributors, with 54 new contributors. Key features include full support for Google Gemma 4 architecture, zero-bubble async scheduling with speculative decoding, and enhancements to the Model Runner V2. This update also introduces compatibility with HuggingFace Transformers v5 and various optimizations for NVIDIA hardware. The release aims to improve throughput and performance for a range of models and applications.
Read originalThe latest version b8991 of llama.cpp has been released, featuring updates for various operating systems.
The latest update to llama-mmap improves compatibility with various platforms and model sizes. Key enhancements include support for 32-bit wasm and updates to gguf.cpp style.

The v0.19.0rc0 release introduces a feature for CPU key-value cache offloading, enhancing performance. This update was signed off by Yifan Qiao.