Gemini 3: How Flash dethroned Pro and changed the cost-quality relationship

In December 2025, Google launched Gemini 3 Flash with a specification that seemed impossible: a fast and cheap model that surpassed, on scientific benchmarks, "Pro" models from previous generations that cost 6 to 10 times more per token. Gemini 3 Flash scored 90.4% on GPQA Diamond — the same benchmark where GPT-4 recorded 53.6% — for US$ 0.50 per million input tokens.

This break is not just numerical. It redefined what the market expects from "Flash" models (fast/cheap) and forced all competitors to reassess their pricing structure.

The architecture of the disruption

Gemini 3 Flash is the result of a combination of factors: architectural improvements accumulated since Gemini 2.0, knowledge distillation from the Pro models into smaller variants, and inference optimizations that Google did not publish in detail. The practical result is 3 times more speed than Gemini 2.5 Pro with equal or superior performance on most benchmarks evaluated at launch.

The context window is 1 million tokens — the same as Pro models from other vendors. The model processes text, images, audio, and video natively, becoming the default model of the Gemini app starting in December 2025.

Gemini 3.1 Pro: when premium is still worth it

Gemini 3.1 Pro, launched in February 2026, maintains Google's flagship position for frontier tasks. With 94.3% on GPQA Diamond — the highest score ever recorded on this benchmark — the model leads in advanced scientific reasoning. It is in second place in the global composite benchmark ranking, practically tied with OpenAI's GPT-5.2.

The 1 million token context (with an experimental 2 million version) and native capacity to process the five types of input — text, image, audio, video, and PDF — in the same API call keep Gemini 3.1 Pro as a reference for complex multimodal use cases. The price is US$ 2.00 per million input tokens and US$ 12.00 per million output for contexts below 200K.

Gemini 3.5 Flash: the next step on the ladder

In May 2026, at Google I/O, Gemini 3.5 Flash was launched — confirming the rapid evolution cadence of the family. The model surpasses Gemini 3.1 Pro on agentic and coding benchmarks (Terminal-Bench 2.1: 76.2% vs 70.3%; MCP Atlas: 83.6% vs 78.2%), with a token output speed 4 times higher.

At the same event, Google announced Gemini 3.5 Pro, expected for general availability in June 2026, and Gemini Omni — a model that accepts and generates video grounded in real-world knowledge. Gemini Spark, aimed at personal agency, was also presented, with the ability to execute actions on behalf of the user.

What the success of Flash means for the market

The rise of Flash models exposes a recurring pattern in the history of LLMs: what was considered a premium capability becomes a commodity in 12 to 18 months. GPT-4 was a frontier leap in 2023. In 2026, models with equivalent performance cost less than US$ 1 per million tokens.

For those who decide AI infrastructure, the practical implication is the need for a layered architecture. There is no longer any justification for using the most expensive model on every request. The emerging configuration uses ultra-cheap models (Flash, DeepSeek Flash) for 70-80% of low-complexity queries, mid-tier models for medium complexity, and premium models only when the task requires frontier reasoning.

Gemini 3.5 Flash, with an estimated price of US$ 1.50 per million input tokens, represents the new floor of the mid tier: faster than the previous Pro, cheaper than any comparable option, with 1 million tokens of context. For consumer products at scale — mobile apps, chatbots with millions of users, support systems — this cost-quality level is transformative.

GPQA Diamond as a thermometer

GPQA Diamond deserves attention as a metric. Unlike the MMLU (university multiple-choice questions, almost saturated), GPQA Diamond uses doctoral research-level questions in the areas of biology, chemistry, and physics, reviewed by active researchers to ensure that the correct answer is verifiable, but that it requires deep reasoning to be found.

Human experts in the specific area of the question get about 70% of the questions right. Non-experts with a doctorate in an adjacent area get 34%. Gemini 3.1 Pro scores 94.3% — above human experts in their own discipline.

This does not mean that the model "understands physics better than physicists." It means that, on structured multiple-choice questions, the model retrieves and combines information more precisely than humans under test conditions. The distance between this performance and the ability to do original scientific research remains immense.

The architecture of the disruption

Gemini 3.1 Pro: when premium is still worth it

Gemini 3.5 Flash: the next step on the ladder

What the success of Flash means for the market

GPQA Diamond as a thermometer

Recibe las publicaciones