# Gemini 3.1 Pro Doubled Its Reasoning Score in 3 Months

**Plutonous** | February 19, 2026 | 14 min read


Tags: Gemini 3.1 Pro, Google DeepMind, ARC-AGI-2, AI Benchmarks 2026, Claude Opus 4.6, GPT-5.2, AI Model Comparison, Agentic AI, MCP Atlas, GPQA Diamond, AI Pricing, LiveCodeBench

---

**TL;DR:** Google's Gemini 3.1 Pro, released February 19, 2026, leads on 13 of 16 industry benchmarks, scoring 77.1% on ARC-AGI-2 (more than doubling Gemini 3 Pro's 31.1%), 94.3% on GPQA Diamond, and a record 2,887 Elo on LiveCodeBench Pro<sup><a href="#source-1">[1]</a></sup>. It beats Claude Opus 4.6 on abstract reasoning, scientific knowledge, and agentic workflows while costing $2.00 per million input tokens versus Opus's $5.00<sup><a href="#source-2">[2]</a></sup>. Three months after Anthropic and OpenAI leapfrogged Gemini 3 Pro, Google just took the crown back, and the speed of this cycle tells you everything about where the AI race is heading.

If you locked in a frontier model contract last month, you're already behind. That's the reality of AI procurement in 2026: the model you chose in November got surpassed in February, and the one that surpassed it just got surpassed again. Two weeks after Claude Opus 4.6 took the crown, Google is back on top with Gemini 3.1 Pro. It doesn't just reclaim lost ground. It redefines what a mid-cycle update can accomplish.

Google took a three-month-old model, applied breakthroughs from its Gemini 3 Deep Think research, and produced something that leads on 13 of 16 major benchmarks. Not against last quarter's models. Against Claude Opus 4.6, GPT-5.2, and GPT-5.3-Codex. The best that Anthropic and OpenAI have right now. And it costs 60% less than Opus.

> **Why This Matters Now**
>
> Gemini 3.1 Pro isn't just another incremental update. It's Google's clearest signal that the company has figured out how to iterate at the speed its competitors set. The 77.1% ARC-AGI-2 score is more than double Gemini 3 Pro's 31.1%, the largest single-generation reasoning jump any lab has demonstrated[1]. Sundar Pichai personally promoted the launch, calling it "a step forward in core reasoning." The model is available now in preview across Google AI Studio, Vertex AI, Gemini CLI, Android Studio, and the Gemini consumer app[3].


## The Benchmark Sweep: 13 of 16 Categories

The numbers are comprehensive and, for Anthropic and OpenAI, uncomfortable. Gemini 3.1 Pro doesn't win by thin margins on a few cherry-picked evaluations. It leads convincingly across reasoning, coding, science, agentic tasks, and multilingual understanding.

- **77.1%**: ARC-AGI-2
- **2,887**: LiveCodeBench Pro
- **94.3%**: GPQA Diamond
- **68.5%**: Terminal-Bench 2.0
- **69.2%**: MCP Atlas
- **85.9%**: BrowseComp


The ARC-AGI-2 result deserves special attention. This benchmark tests abstract reasoning: identify patterns, generalize from minimal examples. Gemini 3 Pro scored 31.1%. Three months later, 77.1%. That's not iteration. That's a different model wearing the same name. The Deep Think research Google conducted between generations didn't just improve reasoning. It rebuilt it.

**77.1%** — ARC-AGI-2 abstract reasoning score


## Head-to-Head: Gemini 3.1 Pro vs. Claude Opus 4.6 vs. GPT-5.2

Here's what the full comparison looks like. The data comes from Google DeepMind's official evaluation methodology, with all models tested under their strongest thinking configurations<sup><a href="#source-4">[4]</a></sup>.


The pattern is revealing. Gemini 3.1 Pro dominates reasoning and agentic benchmarks. Claude Opus 4.6 retains a razor-thin edge on SWE-Bench Verified (80.8% vs. 80.6%), the benchmark that matters most for enterprise coding workflows. GPT-5.2 offers the lowest input price at $1.75 per million tokens but trails both competitors on nearly every metric.

What's often overlooked is the pricing asymmetry. Gemini 3.1 Pro at $2.00 per million input tokens is 60% cheaper than Claude Opus 4.6 at $5.00, while leading on more benchmarks. For enterprises running high-volume inference workloads, that price difference compounds into millions in annual savings.

## Where Gemini 3.1 Pro Doesn't Win

Intellectual honesty requires acknowledging where Google's model falls short, because the gaps are as informative as the leads.


The GDPval-AA Elo gap is the most significant. Claude Sonnet 4.6 scores 1,633 versus Gemini 3.1 Pro's 1,317 on this expert task benchmark. That's not a margin. That's a different league, and it suggests Claude's architecture still handles certain classes of complex, multi-step expert reasoning better than anything Google has produced.

The MRCR v2 result at 1 million tokens is the elephant in the room. Gemini 3.1 Pro scores just 26.3% on pointwise retrieval at its maximum context length, identical to Gemini 3 Pro. Meanwhile, Claude Opus 4.6 scored 76% on the same test in our earlier analysis. Google advertises a 1 million token context window. But if the model can't reliably retrieve information from that context, the number is marketing, not capability.

## The Agentic Revolution: MCP Atlas and BrowseComp

The most strategically important results aren't the traditional benchmarks. They're the agentic ones, and this is where Gemini 3.1 Pro makes its strongest case for enterprise adoption.

- **69.2%**: MCP Atlas
- **85.9%**: BrowseComp
- **33.5%**: APEX-Agents
- **99.3%**: τ2-bench Telecom
- **90.8%**: τ2-bench Retail
- **68.5%**: Terminal-Bench 2.0


MCP Atlas is the benchmark to watch. It measures multi-step workflows using Model Context Protocol, the emerging standard for how AI models interact with external tools and data sources. Gemini 3.1 Pro's 69.2% is 16% above Claude Opus 4.6's 59.5% and 14% above GPT-5.2's 60.6%<sup><a href="#source-4">[4]</a></sup>. As enterprises build increasingly complex agent systems that chain multiple tool calls together, this benchmark becomes the single most predictive measure of real-world agent performance.

BrowseComp tells a similar story. This benchmark combines web search, Python execution, and browser navigation into realistic agentic search tasks. Gemini 3.1 Pro's 85.9% represents a 45% improvement over Gemini 3 Pro's 59.2%, and it leads Claude Opus 4.6 (84.0%) by a meaningful margin. Google's deep integration with Search infrastructure gives it a structural advantage in benchmarks that involve web interaction, and that advantage is showing up in the numbers.


Gemini 3.1 Pro is here. Hitting 77.1% on ARC-AGI-2, it's a step forward in core reasoning.


## The Pricing Play: Premium Performance at Mid-Range Prices

Here's the genius of Google's positioning. Gemini 3.1 Pro delivers benchmark-leading performance at what is effectively mid-tier pricing, creating an uncomfortable value proposition for both Anthropic and OpenAI.


At $2.00 per million input tokens, Gemini 3.1 Pro is 60% cheaper than Claude Opus 4.6 and only 14% more expensive than GPT-5.2. But it leads GPT-5.2 on every single benchmark Google evaluated. For enterprise procurement teams running the math on annual AI spend, the value calculation is straightforward: Gemini 3.1 Pro offers more capability per dollar than any other frontier model available today.

Google also offers aggressive batch pricing at $1.00 per million input tokens and context caching at $0.20 per million tokens, which drops the effective cost even further for high-volume workloads<sup><a href="#source-5">[5]</a></sup>. Add Google's existing cloud infrastructure discounts for Vertex AI customers, and the total cost of ownership gap widens considerably.

## The Deep Think Connection: Where the Reasoning Gains Came From

The 77.1% ARC-AGI-2 score didn't materialize from nowhere. On February 12, one week before 3.1 Pro's launch, Google released Gemini 3 Deep Think, a research model that scored 84.6% on the same benchmark<sup><a href="#source-6">[6]</a></sup>. Google's official statement confirms that 3.1 Pro builds on "the same breakthroughs" that powered Deep Think.


This is a new versioning approach for Google. Previous mid-cycle updates used ".5" increments (Gemini 1.5, 2.5). The ".1" designation suggests Google is moving to faster, smaller iteration cycles rather than waiting for large generational leaps. If Google can deliver this magnitude of improvement every few months, the competitive dynamics of the AI race change fundamentally.

## What This Means for Enterprise AI Buyers

The practical implications for teams choosing between frontier models are significant, and the answer isn't as simple as "pick the one that wins the most benchmarks."


The uncomfortable truth for every model provider: no single model wins everything. Gemini 3.1 Pro leads the broadest set of benchmarks. Claude Opus 4.6 leads the most commercially valuable one (SWE-Bench Verified) and has the best long-context retrieval. GPT-5.2 has the lowest price among Western frontier models. MiniMax M2.5 makes them all look expensive. Enterprise teams increasingly need multi-model strategies, not single-vendor commitments.

## The Competitive Cycle Is Accelerating

Here's the pattern that should concern every AI lab, including Google. The benchmark leadership window is shrinking.


Gemini 3 Pro held the lead for roughly 2.5 months. Claude Opus 4.6 held it for 14 days. If this acceleration continues, benchmark leadership becomes meaningless as a differentiator, and the competition shifts entirely to distribution, pricing, enterprise relationships, and product integration.

Google has structural advantages in several of those dimensions. Its cloud infrastructure serves millions of enterprise customers. Android puts Gemini in billions of pockets. Google Workspace integration means Gemini can be embedded in the tools hundreds of millions of knowledge workers use daily. Anthropic has Claude Code's $1.1B ARR flywheel. OpenAI has ChatGPT's 400 million weekly users. But Google has the broadest distribution surface of any AI company on earth.

## How Gemini 3.1 Pro Compares to MiniMax M2.5

For teams evaluating the full landscape, the comparison with MiniMax M2.5 is as important as the frontier-vs-frontier matchup. MiniMax's open-weight model, released February 12, scores 80.2% on SWE-Bench Verified at $0.15 per million input tokens.


The three models occupy distinct niches. Gemini 3.1 Pro is the broadest performer at mid-range pricing. Claude Opus 4.6 is the coding and long-context specialist at premium pricing. MiniMax M2.5 is the budget coding powerhouse with an 88% hallucination caveat on general knowledge tasks. The era of a single "best model" is definitively over.

## The Safety Dimension: Increased Capabilities, Increased Risks

Google's model card for Gemini 3.1 Pro contains a detail that most coverage is glossing over. The model's cyber capabilities reached DeepMind's "alert threshold," though not the Critical Capability Level<sup><a href="#source-7">[7]</a></sup>. More notably, 3.1 Pro shows "stronger situational awareness than Gemini 3 Pro, achieving near 100% success on challenges no other model has consistently solved."

That's a polite way of saying the model is significantly better at understanding its own situation, context, and capabilities. For safety researchers, increased situational awareness is a double-edged metric. It enables more capable agents. It also enables more capable misuse.

> **The Safety Signal Nobody Is Talking About**
>
> Gemini 3.1 Pro's model card reports "near 100% success on [situational awareness] challenges no other model has consistently solved." Combined with alert-threshold cyber capabilities, this places 3.1 Pro in a new risk category that enterprise security teams need to evaluate carefully[7]. Google's responsible deployment as a preview (not GA) suggests internal awareness that further safety validation is needed before full enterprise rollout.


The real story of Gemini 3.1 Pro isn't that Google reclaimed the benchmark crown. It's that the crown itself is losing value. When leadership flips every two weeks, the sustainable competitive advantages become distribution, pricing, product integration, and developer experience.

Google understands this better than anyone. That's why Gemini 3.1 Pro launched simultaneously across Google AI Studio, Vertex AI, Gemini CLI, Android Studio, the Gemini consumer app, and NotebookLM<sup><a href="#source-3">[3]</a></sup>. Seven platforms on day one.

> **The Bottom Line**
>
> Gemini 3.1 Pro is the best model on the most benchmarks as of February 19, 2026. It won't hold that position for long, and Google knows it. The real signal is the speed: a ".1" update delivering 2x+ reasoning improvements in three months. If Google can sustain this iteration cadence while leveraging its unmatched distribution surface, the AI race becomes less about who has the smartest model and more about who can ship improvements fastest to the most users.


No company on earth has more surface area to ship to than Google. The model is the ammunition. The distribution is the weapon.


*Last updated: February 19, 2026*

---

*Source: [LLM Rumors](https://www.llmrumors.com/news/gemini-3-1-pro-google-reclaims-benchmark-crown)*