GPT-5: The model that redefined what it means to err less
11 jun 2026
When OpenAI launched GPT-5 in August 2025, the figure that drew the most attention was not the mathematics benchmark — it was the reduction in hallucinations. Compared to GPT-4o, GPT-5 in reasoning mode makes 6 times fewer factual errors. Compared to o3, the same quality with half the output tokens. For the first time, the central question about language models began to shift from "how much does it know?" to "how often does it make things up?"
What GPT-5 brought that was different
GPT-5 is not a larger model than GPT-4. It is a fundamentally different architecture: the first OpenAI model with an automatic router that selects the depth of reasoning per query. For a simple question, the model answers directly. For a problem that requires multiple steps, it thinks first. This decision happens transparently, without the user having to choose between modes.
The context window is 400 thousand input tokens and 128 thousand output tokens — double that of GPT-4 Turbo. The API price is US$ 1.25 per million input tokens and US$ 10.00 per million output, positioning the model below Claude Opus 4.7 and Gemini 3.1 Pro in the same capability range.
Benchmarks: where GPT-5 leads
At launch, GPT-5 scored 74.9% on SWE-Bench Verified — the benchmark that evaluates real bug resolution in public GitHub repositories. GPT-5 Pro with tools reached 100% on AIME 2025 and 88.4% on GPQA Diamond (PhD-level questions in the sciences). On HealthBench, the medical benchmark created by OpenAI itself with 262 physicians from 60 countries, the model scored approximately 73% — the best recorded up to that point.
The most impactful figure is not the top benchmark, but the comparison with predecessors: the misinformation rate fell from 4.8% to 2.1% with reasoning mode active. In the context of health and law — where factual errors have serious consequences — this reduction is not statistical. It is the difference between a model usable in production and one that requires constant verification.
GPT-5.x: the accelerated iteration
After GPT-5, OpenAI accelerated the update cycle. GPT-5.2, launched in the second half of 2025, leads the MMLU with 93.0% and was the first model to achieve a perfect score on AIME 2025 consistently. GPT-5.3 Codex focused on code, leading HumanEval with 97.5% and SWE-Bench with 83.0% — the best score of any model in autonomous bug resolution.
GPT-5.4 expanded the context to 1.05 million tokens. GPT-5.5, launched in April 2026, raised the price to US$ 5.00/US$ 30.00 per million tokens but brought even more sophisticated reasoning capabilities, especially for agentic and computer-use tasks.
The saturation of traditional benchmarks
The fast iteration cycle of GPT-5 exposed a structural problem: traditional benchmarks are saturated. The MMLU (a multiple-choice exam with 57 disciplines) has the best models at 88-93% — close to the theoretical ceiling for an exam with 4 alternatives. HumanEval for code already exceeds 95% in the best models.
The industry responded with harder benchmarks: GPQA Diamond (doctoral research questions where field experts err 30% of the time), SWE-Bench Pro (more complex real bugs), Terminal-Bench (autonomous execution of command-line tasks), and Humanity's Last Exam (HLE), a set of questions where PhDs who are experts in the specific field score only 5%.
xAI's Grok 4 was the first model to cross 50% on HLE in July 2025 — a milestone that generated more journalistic coverage than any previous benchmark. The reason is symbolic: on questions where most doctorate holders err, an LLM now gets half right.
What still doesn't work
Despite the advances, GPT-5 still hallucinates. The overall average rate fell to approximately 2% on general tasks, but in specialized domains the numbers are worse: 6-10% in law, 10-20% in medicine for open cases, and up to 64% in summaries of clinical cases without active mitigation.
The difference between the 73% on HealthBench and the 93% on MedQA (a medical multiple-choice test) illustrates the central problem: models know medicine impressively well in structured contexts, but real clinical practice involves uncertainty, incomplete information, patients who describe symptoms ambiguously, and moments when the correct answer is "I don't know, refer to a specialist." This is much harder to solve with parameter scale.