Research

ARC-AGI Frontier Benchmark Tracker 2026

Frontier reasoning benchmark progress in 2026: ARC-AGI-2 cracked by GPT-5.5 at 85%, ARC-AGI-3 launched March 2026 as the new ceiling with Gemini 3.1 Pro at 0.37%.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Frontier models cracked ARC-AGI-2 in Q1 2026. GPT-5.5 hit 85 percent (the grand-prize threshold) and Confluence Lab pushed past 97.9 percent by April. ARC-AGI-3, launched in March 2026 as the new ceiling, has Gemini 3.1 Pro at 0.37 percent. This page consolidates the disclosed scores, the methodology behind the benchmark, and the broader reasoning-benchmark landscape (FrontierMath, Humanity's Last Exam, GPQA-Diamond, SWE-Bench Verified).

Key Findings

  1. GPT-5.5 hit 85 percent on ARC-AGI-2 in March 2026, crossing the grand-prize threshold. GPT-5.4 Pro scored 83.3 percent in January; Gemini 3.1 Pro scored 77.1 percent in February.
  2. Confluence Lab pushed ARC-AGI-2 scores to 97.9 percent by April 2026, a research result that effectively saturated the benchmark.
  3. ARC-AGI-3 was launched in March 2026 as the new frontier benchmark. Gemini 3.1 Pro scored 0.37 percent in the initial evaluation; no frontier model has crossed 5 percent as of May.
  4. FrontierMath, the closed-set advanced mathematics benchmark, saw the leading model (GPT-5.5 with mathematical-reasoning tools) hit 53 percent in March 2026, up from 25 percent in late 2025.
  5. Humanity's Last Exam, the broad-knowledge frontier benchmark, has the leading frontier model at approximately 38 percent in May 2026.

ARC-AGI-2 Leaderboard (May 2026)

Model or SystemScoreCompute TierDate
Confluence Lab research stack97.9%UnconstrainedApr 2026
GPT-5.585.0%HighMar 2026
GPT-5.4 Pro83.3%HighJan 2026
Claude 4.7 Opus81.4%HighMar 2026
Gemini 3.1 Pro Deep Think77.1%HighFeb 2026
Grok 472.8%HighMar 2026
DeepSeek V4 Reasoning68.5%HighApr 2026
Qwen 3.5 Max Thinking61.2%HighApr 2026
Human average66%n/aBaseline
Human top decile92%n/aBaseline

ARC-AGI-3 Initial Evaluation

ModelScoreDate
Gemini 3.1 Pro0.37%Mar 2026
GPT-5.51.8%Apr 2026
Claude 4.7 Opus2.1%Apr 2026
Confluence Lab early run4.5%May 2026
Human average~71%Baseline

Adjacent Reasoning Benchmarks

BenchmarkLeading Model May 2026Score2024 Baseline
FrontierMathGPT-5.5 with tools53%~2% (GPT-4o)
Humanity's Last ExamGPT-5.5~38%~9% (GPT-4o)
GPQA-DiamondClaude 4.7 Opus~88%~50% (GPT-4o)
SWE-Bench VerifiedClaude 4.7 Opus + Claude Code~82%~13% (GPT-4o)
USAMO 2025GPT-5.5 with tools~60%n/a
MATHMultiple at saturation~99%~70% (GPT-4o)
MMLUMultiple at saturation~92%~88% (GPT-4o)
MMLU-ProGPT-5.5~84%~70% (GPT-4o)

What ARC-AGI Measures

ARC-AGI tests fluid intelligence on novel reasoning puzzles that cannot be solved by retrieval or memorisation. Each task consists of a small number of input-output examples; the model must induce the underlying transformation and apply it to a new input. The benchmark resists data contamination because every puzzle is novel, and it resists scaling laws because raw parameter count provides diminishing returns. ARC-AGI-2 is harder than ARC-AGI-1 along three axes: more complex transformations, larger grid sizes, and adversarially selected puzzles that defeat common heuristics. ARC-AGI-3 (launched March 2026) adds compositional reasoning over multiple steps and demands generalisation across visually dissimilar instances of the same underlying rule.

Strategic Context

The pace of progress on ARC-AGI-2 in Q1 2026 was unexpected. The grand-prize threshold of 85 percent was not anticipated to be crossed in 2026 by the consensus of researchers polled in late 2025. The progress reflects three trends: scaling of reasoning compute (test-time inference budget), training on synthetic ARC-style problems, and architectural improvements in attention over visual-grid representations. Whether ARC-AGI-3 falls to similar techniques over the next 12 to 18 months is the most-watched question in frontier evaluation.

Brand Visibility Implications

Frontier benchmark progress is heavily covered in technical and business AI journalism. Brands selling AI reasoning tooling, AI agents, AI for science, and AI evaluation services face strong AI-mediated discovery surface for queries like "best reasoning model 2026", "GPT-5 vs Claude 4.7 reasoning", "AI math benchmark", and similar long-tail terms. The category moves fast enough that AI assistant recommendations frequently lag the underlying benchmark state, creating opportunity for first-mover brands that produce up-to-date comparison content.

Methodology

Benchmark scores compiled from ARC Prize, Epoch AI, lab announcements, and peer-reviewed evaluation papers through 22 May 2026. Scores reflect publicly reported results; some lab-internal benchmarks may differ. Updated monthly with quarterly deep-dive analyses.

How Presenc AI Helps

Presenc AI monitors brand-mention rates on reasoning-benchmark queries across ChatGPT, Claude, Gemini, and Perplexity. For brands selling AI reasoning tools, AI evaluation services, or AI for science, this is the operational visibility into how the journalism cycle around benchmark progress translates into AI-mediated discovery.

Frequently Asked Questions

A reasoning benchmark created by François Chollet that tests fluid intelligence on novel grid-puzzle problems. Each task is a small number of input-output examples; the model must induce the underlying transformation. The benchmark resists data contamination because every puzzle is novel, and resists scaling laws because raw parameter count provides diminishing returns.
GPT-5.5 hit 85 percent on ARC-AGI-2 in March 2026, crossing the grand-prize threshold. Confluence Lab pushed scores to 97.9 percent by April. ARC-AGI-3 launched March 2026 is the new frontier ceiling; no model has crossed 5 percent as of May.
FrontierMath: GPT-5.5 with tools at 53 percent in March 2026, up from approximately 2 percent for GPT-4o in 2024. Humanity\u2019s Last Exam: GPT-5.5 at approximately 38 percent in May 2026.
No. ARC-AGI is one targeted measure of fluid reasoning on visual-grid problems. Saturation on a specific benchmark does not imply general intelligence; ARC-AGI-3 was designed in part because ARC-AGI-2 was approached too quickly. Multiple complementary benchmarks (FrontierMath, GPQA-Diamond, SWE-Bench Verified, Humanity\u2019s Last Exam) together provide a more complete picture.
It depends on the task. GPT-5.5 leads on ARC-AGI-2 and FrontierMath. Claude 4.7 Opus leads on GPQA-Diamond and SWE-Bench Verified. Gemini 3.1 Pro is competitive on multimodal reasoning. The top-tier models cluster within 10-15 percentage points of each other on most benchmarks.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.