The explosion of reasoning models: o3, DeepSeek R1 and the new era of thinking

In September 2024, OpenAI launched o1-preview with a simple and revolutionary premise: before answering, the model thinks. Sixteen months later, explicit reasoning has gone from being a differentiator to being a requirement. Every frontier model in 2026 includes some form of thinking. The question is no longer "does the model reason?" but "how much does this reasoning cost and when is it worth using?"

What reasoning models are

The technical difference is straightforward. Conventional models generate tokens in a single pass — they receive the input and produce the output directly. Reasoning models allocate "thinking" tokens before the final answer: the model writes to itself, decomposes the problem, considers alternatives, checks for inconsistencies, and only then produces the visible answer.

This process is called extended chain-of-thought or test-time compute scaling. The intuition: if training defines what the model knows, inference compute defines how much it applies that knowledge to a specific problem. More thinking tokens make it possible to solve problems that require multiple steps of logic, verification of intermediate results, and exploration of alternative paths.

o3 and o4-mini: OpenAI's production line

o3 and o4-mini, launched in April 2025, consolidated the reasoning model as a mainstream product. o4-mini with tool use reached 99.5% pass@1 on AIME — the leading advanced mathematics benchmark used in the American Mathematical Olympiad competition. o3 scored 96.7% on the same benchmark.

The difference between the two is primarily economic: o4-mini costs half of o3 and delivers similar results in mathematics and code. o3-pro, the most capable version, reaches 36 times the cost of o4-mini — justifiable only for cases where each response token has high business value.

Both were absorbed into GPT-5 in August 2025, which unified the pipeline: a single model that automatically selects the depth of reasoning based on the complexity of the problem.

DeepSeek R1 and R2: the open source version of thinking

DeepSeek launched R1 in January 2025 under an MIT license — the first open source reasoning model with performance comparable to OpenAI's o1. The impact was immediate: any company can host a reasoning model on its own infrastructure without paying per token.

R1 uses 671 billion total parameters (37B active via Mixture of Experts). It was trained with the GRPO algorithm (Group Relative Policy Optimization), which reduces the cost of reinforcement training by approximately 50% relative to previous approaches. The total training cost was less than 6 million dollars — less than one tenth of the estimated cost of comparable models from Western labs.

R2, launched in April 2026, took the opposite direction: 32 billion dense parameters (all active in each inference), without an MoE architecture. The reason is pragmatic — a dense 32B model runs on a single consumer GPU with 24GB of VRAM. For teams that want local reasoning without dependence on a cluster, R2 is the most accessible option available. It scores 92.7% on AIME 2025 — above many proprietary models with a cost per token tens of times higher.

What reasoning solves better (and worse)

The gains of reasoning models are concentrated in specific domains: mathematics, code, formal logic, long-document analysis, and structured scientific reasoning. Problems that decompose into verifiable steps benefit directly from thinking.

For conversational tasks, text synthesis, classification, and creative generation, the gain is marginal and the extra cost rarely justified. The heuristic that emerged in 2025-2026 is direct: use reasoning when the wrong answer has a high cost and the problem has a verifiable logical structure. Use direct models for everything that requires speed and where errors have a low cost.

Test-time compute: the new frontier

The most important discovery of 2025 was not a new model — it was the confirmation that allocating more compute at inference is an axis of improvement as real as increasing training parameters.

This has profound economic implications. The previous paradigm was: training a larger model costs billions, but inference is cheap. The new paradigm adds a dimension: you can spend more on inference in exchange for higher quality — without retraining anything. For high-value use cases, such as drug discovery, production code generation, or complex legal analysis, this economically justifies a significantly higher cost per query.

The open question for 2026-2027 is whether the scaling of test-time compute will encounter diminishing returns, or whether there are still orders of magnitude of gain available by allocating more thinking tokens per query.

What reasoning models are

o3 and o4-mini: OpenAI's production line

DeepSeek R1 and R2: the open source version of thinking

What reasoning solves better (and worse)

Test-time compute: the new frontier

Receba as publicações