GenAI: Should we talk about Content or Data?

Words matters, if you are a publisher, you will talk about Content, 

If you are in GenAI, you will talk about data or supply… for the same thing!

Here, as my article is written with the help of GenAI 😉 we will talk about data … keeping in mind that this « data » is « content »: text, images and video produced by humans, and for most of them, qualitative content!


The rise of powerful language models has led to an insatiable demand for high-quality data to train these systems. As models become larger and more capable, they require unprecedented amounts of text data to learn from. Researchers have found that increasing the amount of training data leads to significant performance improvements for these models.

Initially, companies relied on publicly available datasets like Wikipedia, books, and web crawl data. However, they have now exhausted many of these reputable online sources. The experts estimate that the demand for quality data could surpass the total supply of text ever produced by 2028.

Data and Content

In their desperation for more data, AI companies have resorted to questionable practices that push ethical and legal boundaries:

1. Transcribing YouTube videos: OpenAI reportedly used its Whisper tool to transcribe over 1 million hours of YouTube videos, potentially violating YouTube’s terms of service and infringing on creators’ copyrights. Google has also been accused of similar practices.

2. Accessing private user data: Companies like Google and Meta have discussed tapping into user-generated content from services like Google Docs, Google Maps reviews, Facebook posts, and Instagram data, raising privacy concerns.

3. Copying copyrighted works: Executives at Meta discussed scraping and summarizing copyrighted books, articles, and other creative works without permission or payment to creators, potentially opening the company to lawsuits.

4. Synthetic data generation: As a long-term solution, companies are exploring training AI models on synthetic data generated by other AI models, though the reliability of this approach is still being debated.

The data hunger has led to a surge in the data sales market, which is projected to grow from $2.5 billion today to over $30 billion within a decade. Companies are scrambling to acquire data sources, with some even considering buying entire publishing houses like Simon & Schuster.


Overall, the insatiable demand for data highlights the immense value of information in the AI era. As models become more capable, the companies controlling vast proprietary datasets, like user data from Google, Meta, and other tech giants, may gain a significant competitive advantage in the generative AI race.