.png)
Together AI has developed a highly efficient speech-to-text stack, utilizing NVIDIA's Parakeet-TDT and OpenAI's Whisper models. By optimizing the data path from CPU preprocessing to GPU execution, they have achieved the ability to transcribe 20 hours of audio in under 10 seconds. Key innovations include profile-aware TensorRT execution and GPU-side decoder control, which significantly enhance performance. This advancement is particularly impactful for applications that demand low latency and high throughput in audio transcription.
Read originalThe vLLM v0.22.0 release marks a significant step forward in model performance and infrastructure. With 459 commits from 230 contributors, this update introduces major enhancements like the DeepSeek V4 model's reorganization and NVFP4 fused MoE support, which improve accuracy and efficiency. The Model Runner V2 now defaults to Qwen3 dense models, offering better performance with new features like sleep-mode weight reload. Additionally, the introduction of a Rust frontend and batch-invariant inference improvements highlight the release's focus on speed and flexibility. These updates collectively enhance the vLLM framework's capability to handle complex AI tasks more efficiently.
Llama.cpp has addressed a critical issue in its device selection logic that affected systems using integrated GPUs as their main compute device. Previously, the presence of any RPC server would cause the local iGPU to be ignored, leading to model loading failures. This update ensures that iGPUs are included unless no GPUs are available, allowing for proper tensor allocation and model loading on systems like the Strix Halo with significant unified memory. This fix enhances the reliability of llama.cpp on diverse hardware configurations.