Models & Labs

NVIDIA NeMo AutoModel Boosts Transformers Fine-Tuning

Hugging Face BlogJune 24, 2026high confidence

Why it matters

→NeMo AutoModel significantly reduces training time and memory usage for MoE models.
→It maintains API compatibility with Hugging Face, easing adoption for developers.
→The tool enhances scalability across multiple GPUs, crucial for large-scale AI projects.

NVIDIA NeMo AutoModel Boosts Transformers Fine-Tuning — ©Hugging Face Blog

NVIDIA has introduced NeMo AutoModel, an open library designed to enhance the fine-tuning of Transformers, especially for Mixture of Experts (MoE) models. This tool builds on Transformers v5, incorporating Expert Parallelism and DeepEP fused dispatch to deliver up to 3.7 times faster training and up to 32% less GPU memory usage. The library maintains compatibility with Hugging Face's from_pretrained() API, allowing users to benefit from these optimizations without changing their existing code. NeMo AutoModel is particularly effective in scaling MoE models across multiple GPUs, making it a valuable asset for developers working with large AI models.

Read original

More from Hugging Face Blog

Coding Toolscoding

Run vLLM Server on HF Jobs with One Command

Hugging Face has streamlined the process of deploying a vLLM server with a single command, making it easier for developers to test and evaluate models. By using the official vllm/vllm-openai image and specifying a GPU flavor, users can quickly set up a server for model inference. This approach allows for flexible scaling, accommodating larger models by adjusting GPU resources and parallel processing settings. The integration with Hugging Face's infrastructure simplifies access and management, providing a practical solution for developers needing quick, temporary model deployments.

Hugging Face BlogJun 26, 2026

Researchresearch

Hybrid Models Show Strength in Predicting Meaningful Tokens

Hugging Face's recent study reveals that hybrid language models have distinct advantages over traditional transformers in predicting tokens that carry meaning, such as nouns and verbs. The Olmo Hybrid model outperforms transformers in these areas, showcasing its ability to handle complex language structures. However, when it comes to repetitive tokens, transformers maintain an edge due to their efficient attention mechanisms. This research highlights the importance of evaluating models based on specific token types to uncover architectural strengths. These insights are expected to guide the development of more refined hybrid models, potentially enhancing language model capabilities in the future.

Hugging Face BlogJun 25, 2026

Researchresearch

Hugging Face Launches FFASR Leaderboard for ASR Models

Hugging Face and Treble Technologies have unveiled the FFASR Leaderboard, a pioneering benchmark for assessing automatic speech recognition (ASR) models in realistic far-field acoustic settings. This initiative tackles the discrepancy between traditional benchmarks and actual performance, where elements like reverberation and ambient noise significantly affect model accuracy. By offering a community-driven platform, the leaderboard promotes the creation of models that can withstand these challenging conditions. This development is poised to redirect focus towards enhancing real-world acoustic robustness, providing a more precise evaluation of ASR model performance in complex acoustic scenarios.

Hugging Face BlogJun 24, 2026

More in Models & Labs

Models & Labsmodels

Llama.cpp b9784 Release Enhances Hexagon Performance

The latest b9784 release of llama.cpp brings significant optimizations to Hexagon's matrix multiplication capabilities. By reworking the MUL_MAT and MUL_MAT_ID operations, the update introduces a 32x32 tiled weight repack and improved kernel parameters, enhancing performance and efficiency. These changes aim to optimize register usage and streamline activation processing, particularly benefiting users leveraging Hexagon's architecture. This release doesn't introduce new models but focuses on refining existing processes, making llama.cpp more robust for developers working with diverse hardware configurations.

llama.cpp ReleasesJun 26, 2026

Models & Labsmodels

llama.cpp b9788 release enhances dual-GPU support

The latest release of llama.cpp, b9788, introduces significant improvements for dual-GPU setups with SYCL support, particularly enhancing tensor parallelism. By implementing a degenerate ring all-reduce for dual-GPU configurations, the update optimizes performance for both small and large tensor operations, mirroring CUDA's NCCL allreduce pattern. This release notably boosts performance metrics, with Llama-3.3-70B and Qwen3-Coder-Next-80B-A3B models showing substantial speed improvements. The update positions llama.cpp as a more competitive option for multi-GPU environments, without adding new dependencies or altering build configurations.

llama.cpp ReleasesJun 26, 2026

Models & Labsmodels

OpenAI Develops Custom Chip 'Jalapeño'

OpenAI has announced the development of its first custom chip, named 'Jalapeño'.

The AI Daily BriefJun 25, 2026