← Blog

The LLM price war: how tokens became 280 times cheaper

11 jun 2026

At the beginning of 2024, using GPT-4 Turbo cost US$ 60 per million input tokens. In June 2026, models with equivalent or superior performance cost US$ 0.14 per million (DeepSeek V4 Flash) or are completely free with self-hosting (Llama 4, DeepSeek R2). The price drop in two years is 280-fold — one of the fastest technological deflations ever observed in any software market.

How the deflation happened

Three forces combined to drive down prices.

The first was architectural efficiency. The Mixture of Experts (MoE) architecture made it possible to train models with 400 billion to 1 trillion total parameters while activating only 5-15% of those parameters per inference. An MoE model with 400B total parameters and 17B active costs approximately the same for inference as a dense 17B model — but carries the knowledge of a much larger model. DeepSeek, Qwen, Mistral, and Meta adopted this architecture almost universally.

The second was Chinese competition. DeepSeek demonstrated in January 2025 that it was possible to train a frontier model for less than 6 million dollars — against estimates of 100 million or more for comparable models from OpenAI and Google. With radically lower training costs, DeepSeek prices its API at US$ 0.14/US$ 0.28 per million tokens (input/output), forcing all competitors to respond.

The third was open source. When Llama 4, DeepSeek R2, and Qwen 3.5 are made available for free with licenses that allow commercial use, the pressure on proprietary models is structural. A company that can self-host a model of comparable quality for US$ 0.0002 per 1,000 tokens has little incentive to pay US$ 0.12.

The price map in 2026

The current market is divided into four tiers:

Ultra-cheap (< US$ 0.50/M input): DeepSeek V4 Flash (US$ 0.14), Gemini 3 Flash (US$ 0.50), GPT-4.1 Nano (US$ 0.10), Mistral Small (US$ 0.10). For classification tasks, simple summarization, structured extraction, and direct answers at scale.

Mid-tier (US$ 1.00-3.00/M input): DeepSeek V4 Pro (US$ 1.74), Grok 4.3 (US$ 1.25), Gemini 3.5 Flash (US$ 1.50), GPT-5 original (US$ 1.25), Claude Sonnet 4.6 (US$ 3.00). For medium-complexity tasks, quality content generation, document analysis.

Premium (US$ 3.00-10.00/M input): Gemini 3.1 Pro (US$ 2.00), Claude Opus 4.7 (US$ 5.00), GPT-5.5 (US$ 5.00). For frontier reasoning, medical/legal/scientific use cases where quality is critical.

Ultra-premium: GPT-5.5 Pro (US$ 30.00/M input), Claude Opus 4.8 Fast Mode (US$ 10.00). For pipelines where each generated token has high economic value.

Open source (infrastructure cost only): Llama 4 Scout/Maverick, DeepSeek R2, Qwen 3.5-397B, Gemma 4-31B, Mistral Large 3.

The paradox: total spending rose

Despite the 280-fold drop in cost per token, companies' total spending on LLMs grew 320% in the same period. The explanation is the increase in consumption: agentic workflows make 10-20 LLM calls per user task, RAG architectures inflate the context with reference documents, and continuous monitoring systems keep models active 24 hours.

The logic is analogous to that of cheap electricity: when the marginal cost falls, consumption increases more than proportionally. The "rate" decreased, but the "electricity bill" rose.

Intelligent routing: the companies' response

The pattern that emerged for cost management at scale is routing by complexity. The common heuristic: 70-80% of queries go to ultra-cheap models (Flash, Nano), 15-20% to mid-tier when there is more demanding analysis or generation, and 5-10% to premium only when frontier reasoning is necessary.

Automatic routing tools such as LiteLLM, OpenRouter, and BoltAI classify the complexity of the query before routing it, reducing production costs by 60-80% without perceptible degradation in quality for the end user.

What comes next

Market projections indicate that today's mid-tier models will cost less than US$ 0.10 per million tokens by the end of 2027. Premium frontier models should stabilize between US$ 1-3. Self-hosting via open source models will be economically competitive with APIs for any company processing more than 1 billion tokens per month.

The risk is consolidation: a sustained price war benefits those with lower marginal infrastructure costs. Google (proprietary TPUs), Amazon (Trainium), and Microsoft (Azure scale) have structural advantages over independent labs. The next phase of the price war may be decided not by model architecture, but by datacenter cost.

Get the latest posts

New articles on AI, Vibe Code and Builder Code — by email or Telegram.

or
Get it on Telegram

By subscribing, you agree to receive emails/messages and to the Privacy Policy. You can unsubscribe anytime. No spam.