What data are AI models trained on?

Large language models are pretrained on enormous amounts of web text — articles, books, forums, code and websites. Much of it is copyrighted, and creators rarely explicitly consented to its use for AI training. Since mid-2023, many sites have started blocking AI crawlers in response.

Did AI companies pay for training data?

Largely no. Models were mostly trained on freely scraped web content whose creators received nothing, while the resulting products are sold to businesses. This is the core of lawsuits including the New York Times v OpenAI, Getty Images v Stability AI and the Authors Guild class action.

Does AI-generated content make future models worse?

Yes — this is already documented. Models trained on AI-generated text tend to degrade, a phenomenon sometimes called model collapse. It creates a dependency: AI relies on fresh human-created content even as it makes that content less economically viable to produce.

AI Training Data & Copyright

The short version: most LLMs were pretrained on copyrighted web content without explicit consent. The value flowed from creators to AI companies to business buyers — with the creators paid nothing. That's the heart of the lawsuits, and the reason a quiet content crisis is building underneath the whole industry.

What's in the training data

Articles, books, forum posts, code repositories, images and the broad sweep of the public web. Under most legal frameworks much of this is copyrighted, yet it has been routinely used to pretrain models. Content creators rarely explicitly consented to that downstream use. Awareness has grown sharply: since mid-2023 there's been a marked rise in websites blocking AI crawlers.

The value-chain problem

Writers, coders, scientists and artists created the content. The models trained on it without payment. The companies then charge $20/month or $15 per million tokens for the output. The people whose work made it possible got nothing. That asymmetry — not the technology — is what the legal fights are really about.

The lawsuits

Case	The core question
New York Times v OpenAI	Can copyrighted journalism be used to train models that then compete with it?
Getty Images v Stability AI	Was licensed imagery used without permission to train image models?
Authors Guild class action	Were books ingested without authors' consent or compensation?

All turn on the same question: does training on copyrighted work without permission count as fair use, or as infringement at industrial scale? The answers will shape what AI companies can legally build.

The irony: AI is eating its own food supply

If businesses use AI to generate content instead of writing it, the pool of fresh human writing shrinks. And models trained on AI-generated text get worse — already documented as "model collapse". The content engine that powers LLMs depends on humans continuing to create original work that AI is simultaneously making less economically viable to produce. It's a dependency the industry rarely talks about.

Why this matters for your business

Reliability in niche domains. Where good training data is scarce or contested, output quality and factual reliability drop — relevant to specialised or regulated work. See the Truth Score.
Legal exposure. The outcome of these cases may affect the models you depend on; favour providers with clear data provenance.
The human-content premium. As AI content floods the web, genuinely original human expertise becomes more valuable, not less.

Want the foundations? See what an LLM is, and how data scarcity feeds into hallucination.

AI training data and copyright

What's in the training data

The value-chain problem

The lawsuits

The irony: AI is eating its own food supply

Why this matters for your business