Models & Labs

Together AI's Inference Engine Outperforms Competitors

Together AI BlogMay 19, 2026high confidence

Why it matters

→Together AI's engine significantly improves performance in high-concurrency coding agent workloads.
→The benchmark addresses real-world production challenges, offering more relevant insights than traditional single-user tests.
→This advancement can lead to substantial cost savings and efficiency gains for developers using coding agents.

Together AI's Inference Engine Outperforms Competitors — ©Together AI Blog

Together AI has released benchmark results showing its Inference Engine outperforms competitors in coding agent workloads. The engine delivers 31% more tokens per second than the next fastest open-source engine, thanks to optimizations like ThunderMLA and custom kernel rewrites. The benchmark simulates real-world production scenarios with high concurrency and long input contexts. This advancement allows coding agents to manage higher loads more efficiently, reducing latency and operational costs.

Read original

More from Together AI Blog

Models & Labsagents

ThunderAgent Boosts Agentic Inference Efficiency

ThunderAgent introduces a novel approach to agentic inference, significantly improving throughput and reducing latency in synthetic data generation. By treating each agent workflow as a program rather than isolated requests, it mitigates KV cache thrashing and balances load across nodes. This results in up to 2.5× higher throughput on single nodes and near-linear scaling on multi-node clusters. ThunderAgent's compatibility with existing inference optimizations makes it a practical choice for enhancing large-scale agentic workloads.

Together AI BlogJul 29, 2026

Models & Labsmodels

Together AI partners with Moonshot AI for Kimi models

Together AI's partnership with Moonshot AI marks a significant step in making cutting-edge AI models more accessible to developers. By hosting Moonshot's Kimi models, including the 2.8 trillion parameter Kimi K3, Together AI offers developers immediate access to powerful open-weight models. This collaboration allows for seamless integration and post-training capabilities, enabling developers to fine-tune models for specific applications. The partnership promises to deliver high-performance AI solutions with the flexibility and scalability that open models provide, challenging proprietary systems in the market.

Together AI BlogJul 29, 2026

Models & Labsmodels

Together AI Enhances Model Inference Configuration

Together AI has introduced a sophisticated architecture for model inference that integrates endpoints, deployments, and configurations with capacity-aware traffic splitting. This system allows for seamless rollouts, A/B testing, and zero-downtime updates, making it easier for developers to manage and optimize AI models. By using immutable configurations and a weight-based traffic split, the platform ensures efficient resource allocation and scaling. This development simplifies the deployment process and enhances the reliability of AI applications by ensuring consistent performance and easy rollback options.

Together AI BlogJul 29, 2026

More in Models & Labs

Models & Labsmodels

Llama.cpp adds GLM-5.2 speculative decoding support

Llama.cpp's latest update introduces speculative decoding support for GLM-5.2, enhancing its capabilities with NextN/MTP features. This addition allows for more efficient tensor loading and context management, particularly benefiting models using the GLM_DSA architecture. The update also includes options for exporting models with or without the MTP feature, providing flexibility for developers. This release marks a step forward in optimizing model performance and adaptability, especially for those leveraging the GLM-5.2 framework.

llama.cpp ReleasesJul 30, 2026

Models & Labsmodels

Llama.cpp b10178 Release Adds Trace Logging

The b10178 release of llama.cpp enhances its server capabilities by adding trace logging for slot similarity checking, offering developers detailed insights into prompt cache slot selection processes. This update includes specifics on skip reasons and similarity calculations, which can aid in performance optimization. While no new model architectures are introduced, the release continues to support a wide array of platforms, such as macOS with KleidiAI, Ubuntu with ROCm 7.2, and Windows with CUDA 12 and 13. This makes llama.cpp a more versatile tool for developers working on different systems, reinforcing its position as a comprehensive inference runtime.

llama.cpp ReleasesJul 30, 2026

Models & Labsmodels

llama.cpp b10180 Release Enhances SYCL Performance

The b10180 release of llama.cpp brings notable improvements to SYCL performance, focusing on unary elementwise operations. By introducing a contiguous fast path and employing 32-bit index math, the update aims to boost computational efficiency. The integration of fastdiv for elementwise index math further enhances processing speed. Although there are no new models in this release, llama.cpp continues to evolve as a flexible inference runtime, now more efficient on systems like macOS, Linux, and Windows. Developers working with SYCL can expect smoother and faster operations, reinforcing llama.cpp's adaptability across different computing environments.

llama.cpp ReleasesJul 30, 2026