Has any AI cracked ARC-AGI?

GPT-5.5 hit 85 percent on ARC-AGI-2 in March 2026, crossing the grand-prize threshold. Confluence Lab pushed scores to 97.9 percent by April. ARC-AGI-3 launched March 2026 is the new frontier ceiling; no model has crossed 5 percent as of May.

What about FrontierMath and Humanity\u2019s Last Exam?

FrontierMath: GPT-5.5 with tools at 53 percent in March 2026, up from approximately 2 percent for GPT-4o in 2024. Humanity\u2019s Last Exam: GPT-5.5 at approximately 38 percent in May 2026.

Does benchmark progress mean we have AGI?

No. ARC-AGI is one targeted measure of fluid reasoning on visual-grid problems. Saturation on a specific benchmark does not imply general intelligence; ARC-AGI-3 was designed in part because ARC-AGI-2 was approached too quickly. Multiple complementary benchmarks (FrontierMath, GPQA-Diamond, SWE-Bench Verified, Humanity\u2019s Last Exam) together provide a more complete picture.

Which model is the strongest reasoner in 2026?

It depends on the task. GPT-5.5 leads on ARC-AGI-2 and FrontierMath. Claude 4.7 Opus leads on GPQA-Diamond and SWE-Bench Verified. Gemini 3.1 Pro is competitive on multimodal reasoning. The top-tier models cluster within 10-15 percentage points of each other on most benchmarks.

ARC-AGI Frontier Benchmark Tracker 2026

Q: What is ARC-AGI?

A reasoning benchmark created by François Chollet that tests fluid intelligence on novel grid-puzzle problems. Each task is a small number of input-output examples; the model must induce the underlying transformation. The benchmark resists data contamination because every puzzle is novel, and resists scaling laws because raw parameter count provides diminishing returns.

Frontier models cracked ARC-AGI-2 in Q1 2026. GPT-5.5 hit 85 percent (the grand-prize threshold) and Confluence Lab pushed past 97.9 percent by April. ARC-AGI-3, launched in March 2026 as the new ceiling, has Gemini 3.1 Pro at 0.37 percent. This page consolidates the disclosed scores, the methodology behind the benchmark, and the broader reasoning-benchmark landscape (FrontierMath, Humanity's Last Exam, GPQA-Diamond, SWE-Bench Verified).

Key Findings

GPT-5.5 hit 85 percent on ARC-AGI-2 in March 2026, crossing the grand-prize threshold. GPT-5.4 Pro scored 83.3 percent in January; Gemini 3.1 Pro scored 77.1 percent in February.
Confluence Lab pushed ARC-AGI-2 scores to 97.9 percent by April 2026, a research result that effectively saturated the benchmark.
ARC-AGI-3 was launched in March 2026 as the new frontier benchmark. Gemini 3.1 Pro scored 0.37 percent in the initial evaluation; no frontier model has crossed 5 percent as of May.
FrontierMath, the closed-set advanced mathematics benchmark, saw the leading model (GPT-5.5 with mathematical-reasoning tools) hit 53 percent in March 2026, up from 25 percent in late 2025.
Humanity's Last Exam, the broad-knowledge frontier benchmark, has the leading frontier model at approximately 38 percent in May 2026.

ARC-AGI-2 Leaderboard (May 2026)

Model or System	Score	Compute Tier	Date
Confluence Lab research stack	97.9%	Unconstrained	Apr 2026
GPT-5.5	85.0%	High	Mar 2026
GPT-5.4 Pro	83.3%	High	Jan 2026
Claude 4.7 Opus	81.4%	High	Mar 2026
Gemini 3.1 Pro Deep Think	77.1%	High	Feb 2026
Grok 4	72.8%	High	Mar 2026
DeepSeek V4 Reasoning	68.5%	High	Apr 2026
Qwen 3.5 Max Thinking	61.2%	High	Apr 2026
Human average	66%	n/a	Baseline
Human top decile	92%	n/a	Baseline

ARC-AGI-3 Initial Evaluation

Model	Score	Date
Gemini 3.1 Pro	0.37%	Mar 2026
GPT-5.5	1.8%	Apr 2026
Claude 4.7 Opus	2.1%	Apr 2026
Confluence Lab early run	4.5%	May 2026
Human average	~71%	Baseline

Adjacent Reasoning Benchmarks

Benchmark	Leading Model May 2026	Score	2024 Baseline
FrontierMath	GPT-5.5 with tools	53%	~2% (GPT-4o)
Humanity's Last Exam	GPT-5.5	~38%	~9% (GPT-4o)
GPQA-Diamond	Claude 4.7 Opus	~88%	~50% (GPT-4o)
SWE-Bench Verified	Claude 4.7 Opus + Claude Code	~82%	~13% (GPT-4o)
USAMO 2025	GPT-5.5 with tools	~60%	n/a
MATH	Multiple at saturation	~99%	~70% (GPT-4o)
MMLU	Multiple at saturation	~92%	~88% (GPT-4o)
MMLU-Pro	GPT-5.5	~84%	~70% (GPT-4o)

What ARC-AGI Measures

ARC-AGI tests fluid intelligence on novel reasoning puzzles that cannot be solved by retrieval or memorisation. Each task consists of a small number of input-output examples; the model must induce the underlying transformation and apply it to a new input. The benchmark resists data contamination because every puzzle is novel, and it resists scaling laws because raw parameter count provides diminishing returns. ARC-AGI-2 is harder than ARC-AGI-1 along three axes: more complex transformations, larger grid sizes, and adversarially selected puzzles that defeat common heuristics. ARC-AGI-3 (launched March 2026) adds compositional reasoning over multiple steps and demands generalisation across visually dissimilar instances of the same underlying rule.

Strategic Context

The pace of progress on ARC-AGI-2 in Q1 2026 was unexpected. The grand-prize threshold of 85 percent was not anticipated to be crossed in 2026 by the consensus of researchers polled in late 2025. The progress reflects three trends: scaling of reasoning compute (test-time inference budget), training on synthetic ARC-style problems, and architectural improvements in attention over visual-grid representations. Whether ARC-AGI-3 falls to similar techniques over the next 12 to 18 months is the most-watched question in frontier evaluation.

Brand Visibility Implications

Frontier benchmark progress is heavily covered in technical and business AI journalism. Brands selling AI reasoning tooling, AI agents, AI for science, and AI evaluation services face strong AI-mediated discovery surface for queries like "best reasoning model 2026", "GPT-5 vs Claude 4.7 reasoning", "AI math benchmark", and similar long-tail terms. The category moves fast enough that AI assistant recommendations frequently lag the underlying benchmark state, creating opportunity for first-mover brands that produce up-to-date comparison content.

Methodology

Benchmark scores compiled from ARC Prize, Epoch AI, lab announcements, and peer-reviewed evaluation papers through 22 May 2026. Scores reflect publicly reported results; some lab-internal benchmarks may differ. Updated monthly with quarterly deep-dive analyses.

How Presenc AI Helps

Presenc AI monitors brand-mention rates on reasoning-benchmark queries across ChatGPT, Claude, Gemini, and Perplexity. For brands selling AI reasoning tools, AI evaluation services, or AI for science, this is the operational visibility into how the journalism cycle around benchmark progress translates into AI-mediated discovery.