In one line: a large language model is trained to predict the next token (word-piece) in a sequence. Repeat that across nearly all the text ever written and you get a system that can write, summarise, translate and code — not by understanding, but by completing patterns with extraordinary accuracy.
Next-token prediction, explained
The training goal sounds almost too simple: give the model some text, ask it to predict what comes next, and repeat billions of times. Yet at scale this produces systems that write essays, summarise documents, translate languages and explain code. In learning which word tends to follow another, the model also absorbs grammar, facts, writing styles, code patterns and reasoning-like structures.
The analogy: imagine reading every book, article, forum post and website ever written, then learning to predict — with uncanny accuracy — what word comes next in any sentence. That's an LLM. Not intelligence. Not understanding. Pattern completion at scale.
Why it's brilliant at some things and useless at others
This single fact explains every AI product on the market:
| Brilliant at (recombining known patterns) | Poor at (needs grounding or novelty) |
|---|---|
| Writing and rewriting | Reliable factual recall |
| Summarising supplied text | Exact maths without a tool |
| Translating | Genuinely novel reasoning |
| Explaining and generating code | Real-world grounding |
Tasks that recombine existing knowledge play to the model's strength. Tasks that need new reasoning or verified facts hit its structural weakness — which is why what AI can't do matters as much as what it can.
Parameters and model size, simply
- Parameters are the "weights" a model learned in training — the distilled pattern of everything it read.
- GPT-4 has roughly 1.8 trillion parameters; frontier Claude models, several hundred billion.
- A bigger model isn't automatically smarter — it saw and stored more patterns. It's also slower and more expensive.
- This is why Mixture-of-Experts architecture matters: massive total parameters, but only a tiny relevant subset runs per query. See how DeepSeek did it.
What it means for your business
Understanding next-token prediction is the foundation of every sensible AI decision. It tells you to ground the model in your own data (RAG) for facts, to add a human check for anything consequential, and to match model size to task rather than defaulting to the biggest, priciest option. Start with the business starter guide.
Going deeper
Where did the patterns come from, and who got paid? See training data and copyright. Can this approach ever become real intelligence? See can AI become intelligent and the honest AGI debate.
Now choose one. With the fundamentals clear, use the match engine to find the right model for your task.