What is Mixture of Experts (MoE)?

Mixture of Experts is an architecture where a large model is split into many specialist sub-networks, and only a few relevant ones activate for each query. DeepSeek V3 has 671 billion parameters but activates only about 37 billion — under 5% — per query, making it far cheaper to run than a dense model of similar size.

How did DeepSeek make AI so cheap?

By combining Mixture-of-Experts efficiency with software optimisation forced by US chip export controls. The result matches strong Western models on benchmarks while training faster and costing 10-30x less to run.

Is DeepSeek as good as GPT-4?

On coding and many reasoning benchmarks it is competitive with GPT-4-class models at a fraction of the cost. Its weaknesses are safety guardrails and data sovereignty — its hosted API stores data in China.

DeepSeek Explained — How MoE Works

In one line: traditional "dense" models run every parameter for every query. MoE models activate only the relevant specialists. Same intelligence on tap, a fraction of the compute per answer — which is why Chinese models are 10–30x cheaper to run.

Mixture of Experts, in plain English

Picture a consultancy of 257 specialists. A dense model puts all 257 in the room for every question — expensive and slow. An MoE model has a router that picks the ~9 specialists who actually know the topic and asks only them. DeepSeek V3 does exactly this: 671B total parameters, ~37B active per query (4.8%). You keep the breadth of a huge model while paying for a small one each time.

The numbers

	Dense model (typical US)	MoE model (DeepSeek V3)
Total parameters	All activated	671B
Active per query	100%	~37B (4.8%)
Training cost	Baseline	~5x faster, ~80% lower
Run cost	Baseline	10–30x cheaper

It matches GPT-3.5-class performance on benchmarks while training roughly 5x faster at about 80% lower cost — and competes with GPT-4o on coding at a tiny fraction of the price.

The twist: sanctions caused the breakthrough

China's open-weight dominance is partly a forced response to US export controls. Cut off from Nvidia's H100 and A100 chips since 2022, Chinese labs had to innovate on software efficiency instead of throwing more hardware at the problem. That constraint produced MoE refinements that now benefit the whole industry. The intended handicap became the edge.

Why US labs mostly use dense models

With abundant compute, US labs had less pressure to optimise — they could afford to run everything. MoE isn't unique to China (Western labs use it too), but the relentless efficiency focus that scarcity forced is why the cheapest capable models today come from Chinese labs. Background in the power map.

The catch

Cheap and clever doesn't mean risk-free. DeepSeek's hosted API stores data in China under Chinese law, and its safety guardrails trail Western frontier labs. The mitigation — because the weights are open — is self-hosting. See the risk assessment and self-host vs API.

Want the budget picture? Compare real costs in the cheapest AI API guide and the full Chinese models guide.

How DeepSeek built a GPT-4 rival for a fraction of the cost

Mixture of Experts, in plain English

The numbers

The twist: sanctions caused the breakthrough

Why US labs mostly use dense models

The catch