Google Ships Gemini 3.1 Pro, Doubling Down on Reasoning With a Model That Actually Shows Its Work
By Rina Shimizu
Google's Gemini 3.1 Pro rolls out across the Gemini app, NotebookLM, and developer platforms with a verified ARC-AGI-2 score of 77.1% — more than double its predecessor. The release signals Google's aggressive push into reasoning-first AI, putting direct pressure on OpenAI's GPT-5 and Anthropic's Claude.
Google isn't waiting around. Just one week after updating Gemini 3 Deep Think — its specialized research model — the company is shipping the upgraded core intelligence behind it to everyone. Gemini 3.1 Pro started rolling out today across the Gemini app, NotebookLM, the Gemini API, Vertex AI, and a handful of developer tools. It's available immediately, though in preview for now.
The headline number: a verified 77.1% on ARC-AGI-2, the benchmark designed to test whether a model can solve logic patterns it's never seen before. That's more than double what Gemini 3 Pro scored. In a field where incremental gains are the norm, doubling your reasoning performance in a single release cycle is a statement.
## What's Actually New
Gemini 3.1 Pro isn't a new architecture or a radical departure. It's a refinement of the Gemini 3 series with significantly improved reasoning at its core. Google describes it as "a smarter, more capable baseline for complex problem-solving," which is corporate-speak, sure, but the benchmarks back it up.
The model is designed for tasks where you can't get away with a surface-level answer. Google highlighted several demos: generating website-ready animated SVGs from text prompts, building a live dashboard that pulls telemetry data from the International Space Station, coding a 3D starling murmuration with hand-tracking and generative audio, and translating the atmospheric tone of "Wuthering Heights" into a functional portfolio website.
These aren't cherry-picked parlor tricks. They represent something more interesting — a model that can hold complex, multi-step reasoning chains together long enough to produce working, integrated systems. The ISS dashboard example is particularly telling. It didn't just generate code; it correctly configured a public telemetry API and wired it into a real-time visualization. That's the kind of task that trips up most models because it requires understanding APIs, data formats, and frontend rendering all at once.
For consumers, 3.1 Pro is rolling out in the Gemini app with higher usage limits for Google AI Pro and Ultra subscribers. It's also hitting NotebookLM, but only for paid users. Free-tier users will have to wait.
Developers get access through Google AI Studio, the Gemini CLI, Android Studio, and Google's newer agentic development platform, Antigravity. Enterprise customers can tap it through Vertex AI and Gemini Enterprise.
## The Benchmark Question
The ARC-AGI-2 score deserves scrutiny. ARC-AGI has become something of a litmus test for genuine reasoning ability because it specifically targets novel problem-solving — the kind of thinking that can't be faked with pattern matching over training data. A 77.1% score is strong. It's not perfect, but it puts Gemini 3.1 Pro in rare company.
For context, when the original ARC challenge was introduced, top AI models struggled to break 30%. The second version is harder. Scoring above 75% on it suggests that Google's approach to reasoning — whatever architectural tweaks and training methods they've used — is producing real gains, not just benchmark gaming.
Google didn't share results on other standard benchmarks like MMLU, HumanEval, or MATH in this announcement, which is a notable omission. When a company leads with one benchmark and stays quiet on others, it's worth asking why. It's possible 3.1 Pro excels specifically at novel reasoning tasks while showing more modest improvements elsewhere. We'll know more once independent evaluators get their hands on it.
## How It Stacks Up Against the Competition
The reasoning model race has gotten intense in the last six months. OpenAI's GPT-5, released late last year, set a high bar for multi-step reasoning and has been the default recommendation for complex tasks in most head-to-head comparisons. Anthropic's Claude has carved out a reputation for careful, nuanced analysis — particularly in coding and long-context work — that's made it a favorite among developers.
Gemini 3.1 Pro's ARC-AGI-2 score suggests it could challenge both on pure reasoning tasks. But benchmarks don't tell the whole story. What matters is how the model performs on your specific workflow, and that's where things get subjective fast.
Google's real advantage here might not be the model itself but the distribution. Gemini is baked into Search, Workspace, Android, and now NotebookLM. When you improve the base model, every product gets smarter overnight. OpenAI has ChatGPT and its API. Anthropic has Claude. Google has an entire ecosystem. That distribution moat is hard to replicate, and it means Gemini 3.1 Pro will touch more users by default than either competitor, regardless of which model is technically "best."
The Antigravity platform is also worth watching. Google's been relatively quiet about it, but positioning it alongside this release suggests they're serious about agentic AI — models that don't just answer questions but take actions. If 3.1 Pro's reasoning improvements translate into more reliable autonomous workflows, that's a bigger deal than any benchmark score.
## What It Means for Users
If you're a paying Gemini subscriber, you're getting a meaningful upgrade for free. The model should handle complex questions, multi-step analysis, and creative coding tasks noticeably better than its predecessor. NotebookLM users who've been frustrated with shallow analysis of uploaded documents should see improvements.
For developers, the preview status matters. Google explicitly said they're releasing it in preview to "validate these updates" before general availability. That means you shouldn't bet your production workload on it just yet, but it's absolutely worth testing. The Gemini API pricing and rate limits for 3.1 Pro weren't detailed in the announcement — a detail that'll matter a lot for anyone building at scale.
The broader takeaway is that the gap between the top models is shrinking. Six months ago, you could make a strong case that one model was clearly ahead. Today, Google, OpenAI, and Anthropic are all shipping models that can handle genuinely hard reasoning tasks. The differentiation is moving away from raw capability and toward ecosystem, pricing, reliability, and specialization.
Google's betting that reasoning is the feature that matters most right now, and they're pushing it hard. Whether Gemini 3.1 Pro actually unseats the competition in day-to-day use is something benchmarks can't answer. But with a 77.1% ARC-AGI-2 score and distribution across Google's entire product surface, it's not a model anyone can afford to ignore.
General availability is expected "soon," per Google. We'll be testing it extensively when it arrives.