Understanding AI · Updated June 2026

AI training data and copyright

Humans wrote the content that trained every AI model. The models learned from it for free. The companies now charge for the result. The original creators received nothing — and some are now in court. This is the most under-reported story in AI.

The short version: most LLMs were pretrained on copyrighted web content without explicit consent. The value flowed from creators to AI companies to business buyers — with the creators paid nothing. That's the heart of the lawsuits, and the reason a quiet content crisis is building underneath the whole industry.

What's in the training data

Articles, books, forum posts, code repositories, images and the broad sweep of the public web. Under most legal frameworks much of this is copyrighted, yet it has been routinely used to pretrain models. Content creators rarely explicitly consented to that downstream use. Awareness has grown sharply: since mid-2023 there's been a marked rise in websites blocking AI crawlers.

The value-chain problem

Writers, coders, scientists and artists created the content. The models trained on it without payment. The companies then charge $20/month or $15 per million tokens for the output. The people whose work made it possible got nothing. That asymmetry — not the technology — is what the legal fights are really about.

The lawsuits

CaseThe core question
New York Times v OpenAICan copyrighted journalism be used to train models that then compete with it?
Getty Images v Stability AIWas licensed imagery used without permission to train image models?
Authors Guild class actionWere books ingested without authors' consent or compensation?

All turn on the same question: does training on copyrighted work without permission count as fair use, or as infringement at industrial scale? The answers will shape what AI companies can legally build.

The irony: AI is eating its own food supply

If businesses use AI to generate content instead of writing it, the pool of fresh human writing shrinks. And models trained on AI-generated text get worse — already documented as "model collapse". The content engine that powers LLMs depends on humans continuing to create original work that AI is simultaneously making less economically viable to produce. It's a dependency the industry rarely talks about.

Why this matters for your business

Want the foundations? See what an LLM is, and how data scarcity feeds into hallucination.