20:50, 25 February 2026

Neural Networks in Russia Set to Be Trained Only on “Clean” Data

Lawmakers are considering new rules that would reshape how AI developers handle content and copyright.

As part of preparations for a draft artificial intelligence law, industry associations, IT companies, and relevant government agencies are discussing a provision that would require developers to disclose the data used to train their neural models.

Specifically, developers could be required to provide detailed information about the datasets on which their systems were trained. That would include the dataset’s name, date of creation, format, size, purpose, and origin.

Officials have not yet decided where this information would be stored – in a dedicated AI registry or in a separate registry of datasets.

Draft Legislation

The Ministry of Digital Development is working on a framework bill. The current version does not include a requirement to disclose data sources. At the same time, authorities are discussing criteria that would define a “Russian” neural network, rules for labeling AI-generated content, and liability for the use of AI technologies.

The country already operates experimental legal regimes to test digital innovations. Since 2025, the national project Ekonomika Dannykh (Data Economy) has supported AI research and development. For now, however, ethical codes remain advisory rather than mandatory.

Benefits and Risks

Supporters of the initiative argue that disclosing training data would boost trust in AI systems and make independent model evaluation easier. It could also help create a more transparent data market and standardize reporting requirements for developers.

Developers, for their part, warn of growing bureaucratic pressure. Large models rely on millions of sources, and documenting them in detail would require substantial resources. Frequent updates would further complicate the process. Companies also fear that revealing dataset composition could erode their competitive advantage.

Copyright and the Data Market

Today, many neural networks are trained on publicly available data without obtaining separate consent from rights holders. While this approach accelerates technological progress, it also creates legal gray areas. In the United States, several high-profile lawsuits have already been filed over the use of journalistic materials to train AI systems.

If disclosure of sources becomes mandatory, companies may have to sign licensing agreements with content owners. A commercial data market could expand, with the price of information tied directly to its type and value.

At the same time, experts note that some data remains freely accessible, including public domain materials and information from open sources. However, when working with copyrighted content, developers would need to pay closer attention to the legal integrity of their datasets.

The development of new technologies inevitably raises new legal and ethical questions. Clear and broadly acceptable answers, observers say, are likely to emerge in the near future.

Public administration and services for citizens