
NVIDIA has introduced NeMo AutoModel, an open library designed to enhance the fine-tuning of Transformers, especially for Mixture of Experts (MoE) models. This tool builds on Transformers v5, incorporating Expert Parallelism and DeepEP fused dispatch to deliver up to 3.7 times faster training and up to 32% less GPU memory usage. The library maintains compatibility with Hugging Face's from_pretrained() API, allowing users to benefit from these optimizations without changing their existing code. NeMo AutoModel is particularly effective in scaling MoE models across multiple GPUs, making it a valuable asset for developers working with large AI models.
Read originalHugging Face has streamlined the process of deploying a vLLM server with a single command, making it easier for developers to test and evaluate models. By using the official vllm/vllm-openai image and specifying a GPU flavor, users can quickly set up a server for model inference. This approach allows for flexible scaling, accommodating larger models by adjusting GPU resources and parallel processing settings. The integration with Hugging Face's infrastructure simplifies access and management, providing a practical solution for developers needing quick, temporary model deployments.
© Hugging Face BlogHugging Face's recent study reveals that hybrid language models have distinct advantages over traditional transformers in predicting tokens that carry meaning, such as nouns and verbs. The Olmo Hybrid model outperforms transformers in these areas, showcasing its ability to handle complex language structures. However, when it comes to repetitive tokens, transformers maintain an edge due to their efficient attention mechanisms. This research highlights the importance of evaluating models based on specific token types to uncover architectural strengths. These insights are expected to guide the development of more refined hybrid models, potentially enhancing language model capabilities in the future.
Hugging Face and Treble Technologies have unveiled the FFASR Leaderboard, a pioneering benchmark for assessing automatic speech recognition (ASR) models in realistic far-field acoustic settings. This initiative tackles the discrepancy between traditional benchmarks and actual performance, where elements like reverberation and ambient noise significantly affect model accuracy. By offering a community-driven platform, the leaderboard promotes the creation of models that can withstand these challenging conditions. This development is poised to redirect focus towards enhancing real-world acoustic robustness, providing a more precise evaluation of ASR model performance in complex acoustic scenarios.
The latest b9784 release of llama.cpp brings significant optimizations to Hexagon's matrix multiplication capabilities. By reworking the MUL_MAT and MUL_MAT_ID operations, the update introduces a 32x32 tiled weight repack and improved kernel parameters, enhancing performance and efficiency. These changes aim to optimize register usage and streamline activation processing, particularly benefiting users leveraging Hexagon's architecture. This release doesn't introduce new models but focuses on refining existing processes, making llama.cpp more robust for developers working with diverse hardware configurations.
The latest release of llama.cpp, b9788, introduces significant improvements for dual-GPU setups with SYCL support, particularly enhancing tensor parallelism. By implementing a degenerate ring all-reduce for dual-GPU configurations, the update optimizes performance for both small and large tensor operations, mirroring CUDA's NCCL allreduce pattern. This release notably boosts performance metrics, with Llama-3.3-70B and Qwen3-Coder-Next-80B-A3B models showing substantial speed improvements. The update positions llama.cpp as a more competitive option for multi-GPU environments, without adding new dependencies or altering build configurations.
© The AI Daily BriefOpenAI has announced the development of its first custom chip, named 'Jalapeño'.