Russian Researchers Find AI Models Struggle With Long-Range Reasoning
The AIRI Institute’s artificial intelligence lab has developed a new benchmark showing that even the most advanced language models begin to fail and resort to near-random answers when forced to process large volumes of information. The research was presented at the ICLR 2026 conference in Brazil.

Researchers created MMReD, a benchmark designed to test how effectively AI systems reason over long contexts. Unlike conventional evaluations that require models to retrieve a single fact from a large body of text, MMReD forces systems to analyze entire chains of events, compare them, and draw conclusions. The researchers say this type of reasoning is essential for deploying AI in fields such as medicine, law, and finance.
The benchmark simulates an environment in which five characters move between six rooms. Models receive a sequence of observations ranging from one to 128 steps and must answer questions of varying complexity. Researchers tested 12 systems, including OpenAI’s GPT-4o, Alibaba Cloud’s Qwen2.5-VL-72B, and DeepSeek’s DeepSeek-R1. All of them showed sharp declines in performance as the amount of input data increased.
According to Kurkin, the issue is not tied to any single AI architecture, as all tested models displayed a similar degradation curve. In practice, the systems were able to use only 10% to 20% of the input information effectively. The researchers argue that solving the problem will require deeper architectural changes, including systems built around recurrent memory mechanisms.








































