GPT-5.5 and Claude Opus 4.7: The new bar for autonomous models

In March 2026, GPT-5.4 and Claude Opus 4.6 defined the state of the art. In April, both companies released updates that were not fine-tuning increments — they were revisions of architecture and objective that change what these models are designed to do. The focus is no longer to answer better: it is to execute with more autonomy.

GPT-5.5: rebuilt from scratch

GPT-5.5, released on April 23, 2026, is unusual in OpenAI's history. According to the company, it is the first time since GPT-4.5 that they rebuilt the architecture, the pre-training corpus, and the training objectives simultaneously — it was not an adjustment on top of the previous version.

The design of 5.5 was driven by a specific premise: the model needs to function as an autonomous agent, not as an answer generator. This implies the capacity to call chained tools, maintain state across long tasks, and recover from errors without human intervention.

In practice, the improvements most reported by developers are:

Persistence of instructions in long tasks. Previous models tended to "forget" instructions given at the start of long conversations or complex pipelines. 5.5 treats this as a design requirement, not a secondary characteristic.

Tool orchestration. In pipelines with multiple tools — external APIs, code execution, file reading — 5.5 demonstrates better judgment about when and how to combine capabilities. The error rate in tool sequences dropped measurably.

Improved computer use. The ability to interact with graphical interfaces more reliably, which opens up automation use cases that were previously too fragile for production.

5.5 is available via OpenAI's API with the same endpoints as 5.4. The model is proprietary, with no weights available.

Claude Opus 4.7: a measurable leap in software engineering

Claude Opus 4.7, released on April 16, 2026 by Anthropic, has a precise differentiator: the benchmark numbers are not modest. The leap from 4.6 to 4.7 is the largest between consecutive versions that Anthropic has ever published.

SWE-Bench Verified: from 80.8% to 87.6%. SWE-Bench Pro: from 53.4% to 64.3%. MCP Atlas tool-use performance: 79.1% — the best of any model on the benchmark. And the price remained identical to Opus 4.6: US$ 5 per million input tokens, US$ 25 per million output.

For those who use models in production for code and agent tasks, this leap has concrete implications. SWE-Bench measures the ability to resolve real issues from GitHub repositories — not just generating code, but understanding an existing repository, identifying the problem, and implementing the fix. 87.6% on SWE-Bench Verified is the best mark of any commercially available model.

The Claude Mythos Preview: beyond Opus

Separate from Opus 4.7, Anthropic also presented the Claude Mythos Preview in April 2026 — in restricted access to approximately 50 partner organizations via Project Glasswing.

Mythos is described by Anthropic as "a leap above Opus 4.6" in three areas: cybersecurity vulnerability detection, advanced reasoning, and programming. On GPQA Diamond — the most discriminative scientific reasoning benchmark — Mythos Preview scores 94.6%, currently the best mark of any public or semi-public model.

The fact that Mythos is in restricted preview while Opus 4.7 is widely released suggests a segmentation strategy: Mythos is positioned for security and advanced research use cases, with a partner verification process, while Opus 4.7 serves the general production market.

The convergence on autonomy

The pattern that emerges from the GPT-5.5 and Claude Opus 4.7 updates is not just better benchmark performance. It is a redefinition of what a language model should do: not generate answers, but execute tasks.

This has implications for the way systems are designed. A model that maintains state, orchestrates tools, and recovers from errors autonomously is not just a better component — it is a component that changes the architecture of the system around it.

Gemini 3.1 Pro: the cost context

In the same period, Google's Gemini 3.1 Pro consolidates itself as the cost-effectiveness reference at the frontier: US$ 2.00 per million input tokens and US$ 12.00 output, with 1 million tokens of context and 80.6% on SWE-Bench. It is also the only frontier model with native input of text, image, audio, and video in a single model.

GPT-5.5 and Opus 4.7 are more expensive. The justification lies in the performance in specific autonomy and software engineering use cases where the gaps are significant. For general use, Gemini's cost difference is hard to ignore.

What this means

The message of the first quarter of 2026 is that the frontier of closed models has not stopped. GPT-5.5 and Claude Opus 4.7 represent real capability increases — not incremental, but of scale. And the shared focus on autonomy indicates that the next battlefield is not chat, but the agent that works while you sleep.