10:10, 27 March 2026

Russian Researchers Introduce a New Methodology for Testing AI Assistants

A new framework called DRAGOn enables testing of retrieval-augmented generation systems, offering a way to evaluate how AI assistants perform on constantly updated data.

Russian researchers have developed a new methodology called DRAGOn, designed to test RAG (retrieval-augmented generation) systems that rely on artificial intelligence.

The approach makes it possible, for the first time, to measure how accurately AI assistants operate on continuously updated corporate data. It addresses a long-standing challenge by automating data updates and verifying the reliability of the information used in responses.

The methodology was developed by researchers from Sber, MWS AI, and leading universities, including ITMO University, MISIS University, and HSE University. The result is the first open, dynamic testing framework for Russian-language generative AI systems with retrieval. RAG systems combine large language models with corporate knowledge bases, allowing AI to draw on both when answering user queries. This enables neural networks to deliver more up-to-date information while reducing the risk of errors.

The system automatically extracts new facts from news feeds, builds an internal knowledge map, and requires the AI to cross-reference multiple sources rather than simply reproduce text fragments. A separate “judge” neural network evaluates responses, analyzing factual accuracy and completeness.

Photo - Russian Researchers Introduce a New Methodology for Testing AI Assistants

According to Valentin Malykh, co-author of the study and head of large language model development at MWS AI, the methodology is highly adaptable and can be applied across languages and use cases, from scientific publications to legal documents.

As part of the research, the team also launched the first public leaderboard for Russian-language RAG systems. The results show that combining multiple large language models with advanced retrieval techniques improves accuracy. Experts note that DRAGOn produces multi-layered outputs rather than standard answers, requiring the system to synthesize information from multiple sources and validate it using a judge model.

Practical Applications Across Industries

The methodology can be adopted by organizations across industries. Companies can deploy their own testing environment and evaluate systems using internal data before full-scale implementation. This allows teams to assess how accurately AI performs within specific infrastructure, reduce the risk of errors, and compare different models using consistent metrics.

The research involved specialists from Sber (SberAI team), Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), ITMO University, MISIS University, HSE University, MWS AI, the International University of Information Technologies (IITU), and Yandex School of Data Analysis.

Verification Becomes a Requirement

The new solution reflects a broader market shift. Companies are no longer focused solely on deploying large language models – they also need tools to measure how reliably those systems operate on internal data. The methodology is expected to improve user experience by reducing errors and inconsistencies in AI assistant responses.

AI assistants are already used across sectors, including banking, insurance, healthcare, and education. Tools like DRAGOn will increasingly determine whether responses are detailed, accurate, and based on current data, or instead contain factual errors and outdated information.

At the national level, the development could help generate new ideas and future projects in the IT sector. It enables faster and more reliable validation of AI-generated outputs, while setting higher standards for virtual assistants. Over time, DRAGOn could also gain traction internationally, as it is designed to support multiple languages and domains.

The key advantage of an AI assistant is that it works directly with a user’s data inside the application. It can see tasks, track deadlines, and analyze past planning behavior. It doesn’t just generate ideas – it helps structure existing complexity. On average, it reduces routine workload by 30–50%. Human oversight remains essential, as does decision-making. Neural networks do not replace the ability to think or remove responsibility. But the level of support they already provide for routine tasks is significant

Vladimir Zaverytailov

creator of the SingularityApp AI assistant

Digital products and platforms