The short version: most LLMs were pretrained on copyrighted web content without explicit consent. The value flowed from creators to AI companies to business buyers — with the creators paid nothing. That's the heart of the lawsuits, and the reason a quiet content crisis is building underneath the whole industry.
What's in the training data
Articles, books, forum posts, code repositories, images and the broad sweep of the public web. Under most legal frameworks much of this is copyrighted, yet it has been routinely used to pretrain models. Content creators rarely explicitly consented to that downstream use. Awareness has grown sharply: since mid-2023 there's been a marked rise in websites blocking AI crawlers.
The value-chain problem
Writers, coders, scientists and artists created the content. The models trained on it without payment. The companies then charge $20/month or $15 per million tokens for the output. The people whose work made it possible got nothing. That asymmetry — not the technology — is what the legal fights are really about.
The lawsuits
| Case | The core question |
|---|---|
| New York Times v OpenAI | Can copyrighted journalism be used to train models that then compete with it? |
| Getty Images v Stability AI | Was licensed imagery used without permission to train image models? |
| Authors Guild class action | Were books ingested without authors' consent or compensation? |
All turn on the same question: does training on copyrighted work without permission count as fair use, or as infringement at industrial scale? The answers will shape what AI companies can legally build.
The irony: AI is eating its own food supply
If businesses use AI to generate content instead of writing it, the pool of fresh human writing shrinks. And models trained on AI-generated text get worse — already documented as "model collapse". The content engine that powers LLMs depends on humans continuing to create original work that AI is simultaneously making less economically viable to produce. It's a dependency the industry rarely talks about.
Why this matters for your business
- Reliability in niche domains. Where good training data is scarce or contested, output quality and factual reliability drop — relevant to specialised or regulated work. See the Truth Score.
- Legal exposure. The outcome of these cases may affect the models you depend on; favour providers with clear data provenance.
- The human-content premium. As AI content floods the web, genuinely original human expertise becomes more valuable, not less.
Want the foundations? See what an LLM is, and how data scarcity feeds into hallucination.