AI has run out of data to feed itself, and that’s a serious problem

AI uses qualitative and quantitative data to ensure its effectiveness. These are becoming increasingly scarce, and this could complicate the development of the technology.

It should be noted that data science lies at the heart of the development of artificial intelligence. The language models feed on a considerable volume of information to learn and acquire new abilities. It’s not just a question of quantity, however.AI also needs data qualitative data.

Data shortage by 2026?

Qualitative data is not inexhaustible. Researchers working on artificial reason have been warning for almost a year about the scarcity of such data. This paper is available in the online archive arXiv details their concerns.

Indeed, their forecasts predict that AI companies could run out of high-quality data as early as 2026. These companies would then have to turn to lower-quality data. But even the latter could also no longer be sufficient between 2030 and 2060.

Rita Matulionyte referred to this situation in a recent essay published on The Conversation website. For the record, this speaker is a professor in information technology law to theMacquarie University Sydney, Australia.

Synthetic data instead of natural data

Given the amount of data that AI systems systems need to function and improve, specialist companies find themselves in a precarious position. The remarkable evolution in the capabilities of language models comes from the fact that developers are feeding them with more data.

If the supply of data stagnates, the industry will also experience a noticeable slowdown. The synthetic data are the solution for Matulionyte.

In the research and development of artificial intelligence, qualitative data refers to natural data. It should be noted that the latter are generated by humans. These natural data are thus in opposition to the synthetic content from Generative AI.

Feeding AI with synthetic data, a viable solution?

The use of synthetic data can completely break a language model. Studies show that training with AI-generated content significantly undermines the effectiveness of the resulting model. In particular, generative AI fed with this type of content can lack relevance and variety in its results.

Models fed with synthetic content are less effective. However, this does not prevent some companies from experimenting with this type of data.


An alternative to synthetic content would be to implement a natural data farm. Hundreds or even thousands of people would gather in a gigantic hangar, each with a smartphone or computer. Their daily activities would then generate natural data.

The solution seems practical at first glance. However, its realization poses a number of problems, particularly for companies in the artificial intelligence sector.

In principle, a company specializing in AI can seek collaboration with an entity with a large quantity of high-quality data. This is undoubtedly what motivated the merger between Anthropic and the two Internet giants, Google and Amazon.