Research

Benchmark Study on Instruction Following in LRMs

Together AI BlogOctober 22, 2025high confidence

Why it matters

→This benchmark provides a systematic way to evaluate LRM performance in following instructions. • Understanding instruction adherence can lead to improvements in model design and user interaction. • The findings may influence future research and development in large reasoning models.

Together AI has introduced ReasonIF, a benchmark designed to assess how well large reasoning models (LRMs) follow user instructions throughout their reasoning processes. The study found that models like GPT-OSS-120B and Qwen3-235B fail to adhere to instructions more than 75% of the time, with performance degrading as task difficulty increases. ReasonIF consists of 300 math and science problems paired with specific reasoning instructions, aiming to improve controllability and transparency in model outputs. This research highlights the need for better instruction adherence in LRMs to enhance their usability and reliability.

Read original

Benchmark Study on Instruction Following in LRMs

Why it matters

More from Together AI Blog

Together AI Partners with Adaption

Together AI addresses Copy Fail vulnerability

More in Research

Beacon Biosignals maps brain activity during sleep

MIT Student Explores Language and AI Intersections

Red-teaming AI agent networks reveals new vulnerabilities