Back to News
Gemini 3.1 Pro

Gemini 3.1 Pro: Google Reclaims the AI Benchmark Crown

LLM Rumors··14 min read·
...
Gemini 3.1 ProGoogle DeepMindAI BenchmarksClaude Opus 4.6GPT-5.2Agentic AIMCP AtlasAI Competition
Gemini 3.1 Pro: Google Reclaims the AI Benchmark Crown

TL;DR: Google's Gemini 3.1 Pro, released February 19, 2026, leads on 13 of 16 industry benchmarks, scoring 77.1% on ARC-AGI-2 (more than doubling Gemini 3 Pro's 31.1%), 94.3% on GPQA Diamond, and a record 2,887 Elo on LiveCodeBench Pro[1]. It beats Claude Opus 4.6 on abstract reasoning, scientific knowledge, and agentic workflows while costing $2.00 per million input tokens versus Opus's $5.00[2]. Three months after Anthropic and OpenAI leapfrogged Gemini 3 Pro, Google just took the crown back, and the speed of this cycle tells you everything about where the AI race is heading.

The AI benchmark throne has the shelf life of a banana. In November 2025, Gemini 3 Pro launched as the leading model. By February 5, Claude Opus 4.6 had overtaken it on enterprise-critical benchmarks. Two weeks later, Google is back on top with Gemini 3.1 Pro, a model that doesn't just reclaim lost ground. It redefines what a mid-cycle update can accomplish.

Let's be clear about what just happened. Google took a three-month-old model, applied breakthroughs from its Gemini 3 Deep Think research, and produced something that leads on 13 of 16 major benchmarks. Not against last quarter's models. Against Claude Opus 4.6, GPT-5.2, and GPT-5.3-Codex, the best that Anthropic and OpenAI have to offer right now.

BREAKING

Why This Matters Now

Gemini 3.1 Pro isn't just another incremental update. It's Google's clearest signal that the company has figured out how to iterate at the speed its competitors set. The 77.1% ARC-AGI-2 score is more than double Gemini 3 Pro's 31.1%, the largest single-generation reasoning jump any lab has demonstrated[1]. Sundar Pichai personally promoted the launch, calling it "a step forward in core reasoning." The model is available now in preview across Google AI Studio, Vertex AI, Gemini CLI, Android Studio, and the Gemini consumer app[3].

Developing story

The Benchmark Sweep: 13 of 16 Categories

The numbers are comprehensive and, for Anthropic and OpenAI, uncomfortable. Gemini 3.1 Pro doesn't win by thin margins on a few cherry-picked evaluations. It leads convincingly across reasoning, coding, science, agentic tasks, and multilingual understanding.

By The Numbers

77.1%
ARC-AGI-2

Abstract reasoning, more than 2x Gemini 3 Pro

2,887
LiveCodeBench Pro

Highest Elo ever recorded (competitive coding)

94.3%
GPQA Diamond

Scientific knowledge, leading all models

68.5%
Terminal-Bench 2.0

Agentic terminal coding, new SOTA

69.2%
MCP Atlas

Multi-step MCP workflows, leading

85.9%
BrowseComp

Agentic search, up from 59.2%

LLMRumors.com

The ARC-AGI-2 result deserves special attention. This benchmark tests abstract reasoning, the ability to identify patterns and generalize from minimal examples. Gemini 3 Pro scored 31.1%. Three months later, Gemini 3.1 Pro scores 77.1%. That's not an incremental improvement. That's a fundamental capability shift, and it suggests that the Deep Think research Google conducted between generations produced genuine breakthroughs in reasoning architecture.

77.1%
ARC-AGI-2 abstract reasoning score
LLMRumors.com

Head-to-Head: Gemini 3.1 Pro vs. Claude Opus 4.6 vs. GPT-5.2

Here's what the full comparison looks like. The data comes from Google DeepMind's official evaluation methodology, with all models tested under their strongest thinking configurations[4].

FeatureGemini 3.1 ProClaude Opus 4.6GPT-5.2
ARC-AGI-277.1%68.8%52.9%
GPQA Diamond94.3%91.3%92.4%
Terminal-Bench 2.068.5%65.4%54.0%
SWE-Bench Verified80.6%80.8%80.0%
MCP Atlas69.2%59.5%60.6%
BrowseComp85.9%84.0%65.8%
APEX-Agents33.5%29.8%23.0%
Humanity's Last Exam44.4%40.0%34.5%
MMMLU92.6%91.1%89.6%
Input $/1M tokens$2.00$5.00$1.75
LLMRumors.com

The pattern is revealing. Gemini 3.1 Pro dominates reasoning and agentic benchmarks. Claude Opus 4.6 retains a razor-thin edge on SWE-Bench Verified (80.8% vs. 80.6%), the benchmark that matters most for enterprise coding workflows. GPT-5.2 offers the lowest input price at $1.75 per million tokens but trails both competitors on nearly every metric.

What's often overlooked is the pricing asymmetry. Gemini 3.1 Pro at $2.00 per million input tokens is 60% cheaper than Claude Opus 4.6 at $5.00, while leading on more benchmarks. For enterprises running high-volume inference workloads, that price difference compounds into millions in annual savings.

Where Gemini 3.1 Pro Doesn't Win

Intellectual honesty requires acknowledging where Google's model falls short, because the gaps are as informative as the leads.

FeatureLeaderGemini 3.1 Pro
SWE-Bench VerifiedOpus 4.6: 80.8%80.6%
GDPval-AA EloSonnet 4.6: 1,6331,317
SWE-Bench ProGPT-5.3-Codex: 56.8%54.2%
Humanity's Last Exam (tools)Opus 4.6: 53.1%51.4%
MRCR v2 (1M context)Opus 4.6: 76%26.3%
LLMRumors.com

The GDPval-AA Elo gap is the most significant. Claude Sonnet 4.6 scores 1,633 versus Gemini 3.1 Pro's 1,317 on this expert task benchmark. That's not a margin. That's a different league, and it suggests Claude's architecture still handles certain classes of complex, multi-step expert reasoning better than anything Google has produced.

The MRCR v2 result at 1 million tokens is the elephant in the room. Gemini 3.1 Pro scores just 26.3% on pointwise retrieval at its maximum context length, identical to Gemini 3 Pro. Meanwhile, Claude Opus 4.6 scored 76% on the same test in our earlier analysis. Google advertises a 1 million token context window. But if the model can't reliably retrieve information from that context, the number is marketing, not capability.

The Agentic Revolution: MCP Atlas and BrowseComp

The most strategically important results aren't the traditional benchmarks. They're the agentic ones, and this is where Gemini 3.1 Pro makes its strongest case for enterprise adoption.

By The Numbers

69.2%
MCP Atlas

Multi-step MCP workflows, 16% above Opus 4.6

85.9%
BrowseComp

Agentic web search, 45% above Gemini 3 Pro

33.5%
APEX-Agents

Long-horizon tasks, 82% above Gemini 3 Pro

99.3%
τ2-bench Telecom

Tied with Opus 4.6 for highest score

90.8%
τ2-bench Retail

Within 1.1% of Opus 4.6 (91.9%)

68.5%
Terminal-Bench 2.0

Agentic terminal coding, 3.1% above Opus 4.6

LLMRumors.com

MCP Atlas is the benchmark to watch. It measures multi-step workflows using Model Context Protocol, the emerging standard for how AI models interact with external tools and data sources. Gemini 3.1 Pro's 69.2% is 16% above Claude Opus 4.6's 59.5% and 14% above GPT-5.2's 60.6%[4]. As enterprises build increasingly complex agent systems that chain multiple tool calls together, this benchmark becomes the single most predictive measure of real-world agent performance.

BrowseComp tells a similar story. This benchmark combines web search, Python execution, and browser navigation into realistic agentic search tasks. Gemini 3.1 Pro's 85.9% represents a 45% improvement over Gemini 3 Pro's 59.2%, and it leads Claude Opus 4.6 (84.0%) by a meaningful margin. Google's deep integration with Search infrastructure gives it a structural advantage in benchmarks that involve web interaction, and that advantage is showing up in the numbers.

Sundar Pichai, CEO of Google, February 19, 2026
LLMRumors.com

The Pricing Play: Premium Performance at Mid-Range Prices

Here's the genius of Google's positioning. Gemini 3.1 Pro delivers benchmark-leading performance at what is effectively mid-tier pricing, creating an uncomfortable value proposition for both Anthropic and OpenAI.

FeatureInput $/1MOutput $/1MBest Benchmark Count
Gemini 3.1 Pro$2.00$12.0013 of 16
Claude Opus 4.6$5.00$25.003 of 16
GPT-5.2$1.75$14.000 of 16
MiniMax M2.5$0.15$1.20Coding specialist
GPT-5.3-CodexN/AN/A2 (limited eval)
LLMRumors.com

At $2.00 per million input tokens, Gemini 3.1 Pro is 60% cheaper than Claude Opus 4.6 and only 14% more expensive than GPT-5.2. But it leads GPT-5.2 on every single benchmark Google evaluated. For enterprise procurement teams running the math on annual AI spend, the value calculation is straightforward: Gemini 3.1 Pro offers more capability per dollar than any other frontier model available today.

Google also offers aggressive batch pricing at $1.00 per million input tokens and context caching at $0.20 per million tokens, which drops the effective cost even further for high-volume workloads[5]. Add Google's existing cloud infrastructure discounts for Vertex AI customers, and the total cost of ownership gap widens considerably.

The Deep Think Connection: Where the Reasoning Gains Came From

The 77.1% ARC-AGI-2 score didn't materialize from nowhere. On February 12, one week before 3.1 Pro's launch, Google released Gemini 3 Deep Think, a research model that scored 84.6% on the same benchmark[6]. Google's official statement confirms that 3.1 Pro builds on "the same breakthroughs" that powered Deep Think.

Gemini 3 Series Timeline

Key milestones in development

DateMilestoneSignificance
Nov 18, 2025
Gemini 3 Pro Released
Launched as leading model, quickly surpassed by Anthropic and OpenAI within months
Feb 5, 2026
Claude Opus 4.6 Launches
Anthropic takes benchmark lead with 80.8% SWE-Bench, 1,496 LMSYS Elo, 40% enterprise share
Feb 12, 2026
Gemini 3 Deep Think
Research model hits 84.6% ARC-AGI-2, demonstrates breakthrough reasoning capabilities
Feb 12, 2026
MiniMax M2.5 Released
Open-weight model matches Opus 4.6 on coding at $0.15/M tokens, disrupting pricing
Feb 12, 2026
3.1 Pro Preview Spotted
Leaked preview appears in API listings before official announcement
Feb 19, 2026
Gemini 3.1 Pro Launches
Leads 13/16 benchmarks, 77.1% ARC-AGI-2, available across all Google platforms
LLMRumors.com

This is a new versioning approach for Google. Previous mid-cycle updates used ".5" increments (Gemini 1.5, 2.5). The ".1" designation suggests Google is moving to faster, smaller iteration cycles rather than waiting for large generational leaps. If Google can deliver this magnitude of improvement every few months, the competitive dynamics of the AI race change fundamentally.

What This Means for Enterprise AI Buyers

The practical implications for teams choosing between frontier models are significant, and the answer isn't as simple as "pick the one that wins the most benchmarks."

Enterprise Decision Framework

Choose Gemini 3.1 Pro For

Reasoning-heavy workflows, agentic systems, multi-step tool use, scientific analysis, and cost-sensitive high-volume inference

MCP-based agent pipelinesResearch and scientific analysisHigh-volume batch processing

Choose Claude Opus 4.6 For

Enterprise coding workflows, long-context document analysis, security auditing, and workloads requiring reliable 1M token retrieval

SWE-bench-class code generationLegal/financial document reviewZero-day vulnerability discovery

Choose GPT-5.2 For

Lowest-cost frontier inference, existing OpenAI ecosystem integrations, and workloads where marginal benchmark differences don't matter

ChatGPT Enterprise deploymentsAzure OpenAI integrationsCost-first API consumption

Choose MiniMax M2.5 For

Budget-constrained coding agents, self-hosted deployments, and workloads where 33x cost savings outweigh marginal quality differences

Multi-agent swarmsSelf-hosted inferenceStartup-scale agentic products
LLMRumors.com

The uncomfortable truth for every model provider: no single model wins everything. Gemini 3.1 Pro leads the broadest set of benchmarks. Claude Opus 4.6 leads the most commercially valuable one (SWE-Bench Verified) and has the best long-context retrieval. GPT-5.2 has the lowest price among Western frontier models. MiniMax M2.5 makes them all look expensive. Enterprise teams increasingly need multi-model strategies, not single-vendor commitments.

The Competitive Cycle Is Accelerating

Here's the pattern that should concern every AI lab, including Google. The benchmark leadership window is shrinking.

The Shrinking Leadership Window

1

Gemini 3 Pro (Nov 2025)

Led for approximately 2.5 months before being overtaken by Claude Opus 4.6

2

Claude Opus 4.6 (Feb 5)

Led for 14 days before Gemini 3.1 Pro reclaimed the majority of benchmarks

3

Gemini 3.1 Pro (Feb 19)

Current leader on 13/16 benchmarks. How long before the next leapfrog?

4

Next Frontier Model (?)

Claude Opus 4.7? GPT-5.3? Gemini 4? The cycle suggests weeks, not months

Gemini 3 Pro held the lead for roughly 2.5 months. Claude Opus 4.6 held it for 14 days. If this acceleration continues, benchmark leadership becomes meaningless as a differentiator, and the competition shifts entirely to distribution, pricing, enterprise relationships, and product integration.

Google has structural advantages in several of those dimensions. Its cloud infrastructure serves millions of enterprise customers. Android puts Gemini in billions of pockets. Google Workspace integration means Gemini can be embedded in the tools hundreds of millions of knowledge workers use daily. Anthropic has Claude Code's $1.1B ARR flywheel. OpenAI has ChatGPT's 400 million weekly users. But Google has the broadest distribution surface of any AI company on earth.

How Gemini 3.1 Pro Compares to MiniMax M2.5

For teams evaluating the full landscape, the comparison with MiniMax M2.5 is as important as the frontier-vs-frontier matchup. MiniMax's open-weight model, released February 12, scores 80.2% on SWE-Bench Verified at $0.15 per million input tokens.

FeatureGemini 3.1 ProMiniMax M2.5Claude Opus 4.6
SWE-Bench Verified80.6%80.2%80.8%
Input $/1M tokens$2.00$0.15$5.00
Context Window1M tokens200K tokens1M tokens (beta)
Open WeightNoYes (modified MIT)No
BrowseComp85.9%76.3%84.0%
Hallucination RateTBD88% (AA benchmark)Low
Multi-SWE-BenchN/A51.3% (#1)N/A
LLMRumors.com

The three models occupy distinct niches. Gemini 3.1 Pro is the broadest performer at mid-range pricing. Claude Opus 4.6 is the coding and long-context specialist at premium pricing. MiniMax M2.5 is the budget coding powerhouse with an 88% hallucination caveat on general knowledge tasks. The era of a single "best model" is definitively over.

The Safety Dimension: Increased Capabilities, Increased Risks

Google's model card for Gemini 3.1 Pro contains a detail that most coverage is glossing over. The model's cyber capabilities reached DeepMind's "alert threshold," though not the Critical Capability Level[7]. More notably, 3.1 Pro shows "stronger situational awareness than Gemini 3 Pro, achieving near 100% success on challenges no other model has consistently solved."

That's a polite way of saying the model is significantly better at understanding its own situation, context, and capabilities. For safety researchers, increased situational awareness is a double-edged metric. It enables more capable agents. It also enables more capable misuse.

WARNING

The Safety Signal Nobody Is Talking About

Gemini 3.1 Pro's model card reports "near 100% success on [situational awareness] challenges no other model has consistently solved." Combined with alert-threshold cyber capabilities, this places 3.1 Pro in a new risk category that enterprise security teams need to evaluate carefully[7]. Google's responsible deployment as a preview (not GA) suggests internal awareness that further safety validation is needed before full enterprise rollout.

What Gemini 3.1 Pro's Launch Tells Us About the AI Race

1.

Benchmark leadership now has a shelf life measured in weeks, not quarters

Gemini 3 Pro led for 2.5 months. Claude Opus 4.6 led for 14 days. The acceleration means benchmark results are increasingly poor proxies for sustained competitive advantage.

Tip:Evaluate models on your specific use case, not on aggregate benchmark rankings
2.

Agentic benchmarks are the new battleground

MCP Atlas, BrowseComp, APEX-Agents, and tau2-bench are where the next wave of enterprise value will be created. Gemini 3.1 Pro's dominance here signals Google's strategic bet on agents over chatbots.

Tip:Prioritize MCP Atlas and BrowseComp results when evaluating models for agent-based architectures
3.

The pricing hierarchy doesn't match the performance hierarchy

GPT-5.2 is cheapest ($1.75) but trails on every benchmark. Gemini 3.1 Pro leads most benchmarks at $2.00. Claude Opus 4.6 is most expensive ($5.00) but leads on coding. Price and quality are fully decoupled.

Tip:Run cost-per-task analysis rather than cost-per-token comparisons
4.

Multi-model strategies are now mandatory, not optional

No model wins everything. Gemini 3.1 Pro for reasoning and agents, Opus 4.6 for coding and long-context, M2.5 for budget workloads. Enterprise teams need model routing, not model loyalty.

Tip:Implement model routing layers that dispatch tasks to the optimal model based on task type
5.

Context window size without retrieval accuracy is misleading

Gemini 3.1 Pro advertises 1M tokens but scores 26.3% on MRCR v2 pointwise at that length. Claude Opus 4.6 scores 76% on the same test. Always test retrieval accuracy at your actual context lengths.

Tip:Run MRCR-style needle-in-haystack tests at your deployment context length before committing
LLMRumors.com

The real story of Gemini 3.1 Pro isn't that Google reclaimed the benchmark crown. It's that the crown itself is losing value. When leadership flips every two weeks, the sustainable competitive advantages become distribution, pricing, product integration, and developer experience, not who sits atop the leaderboard on any given Tuesday.

Google understands this better than anyone. That's why Gemini 3.1 Pro launched simultaneously across Google AI Studio, Vertex AI, Gemini CLI, Android Studio, the Gemini consumer app, NotebookLM, and Google's new Antigravity agentic development platform[3]. The model is the ammunition. The distribution is the weapon.

WARNING

The Bottom Line

Gemini 3.1 Pro is the best model on the most benchmarks as of February 19, 2026. It won't hold that position for long, and Google knows it. The real signal is the speed: a ".1" update delivering 2x+ reasoning improvements in three months. If Google can sustain this iteration cadence while leveraging its unmatched distribution surface, the AI race becomes less about who has the smartest model and more about who can ship improvements fastest to the most users. Right now, no company on earth has more surface area to ship to than Google.

Sources & References

Key sources and references used in this article

#SourceOutletDateKey Takeaway
1
Google Blog: Gemini 3.1 Pro Announcement
2
Google DeepMind: Gemini 3.1 Pro Model Evaluation
3
Google Cloud Blog: Gemini 3.1 Pro on Vertex AI
4
Google DeepMind: Gemini 3.1 Pro Model Card
5
Google AI for Developers: Gemini API Pricing
6
MarkTechPost: Gemini 3 Deep Think ARC-AGI-2 Analysis
7
9to5Google: Gemini 3.1 Pro for Complex Problem-Solving
8
OfficeChai: Gemini 3.1 Pro Benchmark Analysis
9
Anthropic: Claude Opus 4.6 Announcement
10
MiniMax: M2.5 Official Announcement
11
Scale AI: MCP Atlas Leaderboard
12
ARC Prize: ARC-AGI-2 Leaderboard
12 sourcesClick any row to visit original

Last updated: February 19, 2026