Frontier models cracked ARC-AGI-2 in Q1 2026. GPT-5.5 hit 85 percent (the grand-prize threshold) and Confluence Lab pushed past 97.9 percent by April. ARC-AGI-3, launched in March 2026 as the new ceiling, has Gemini 3.1 Pro at 0.37 percent. This page consolidates the disclosed scores, the methodology behind the benchmark, and the broader reasoning-benchmark landscape (FrontierMath, Humanity's Last Exam, GPQA-Diamond, SWE-Bench Verified).
Key Findings
- GPT-5.5 hit 85 percent on ARC-AGI-2 in March 2026, crossing the grand-prize threshold. GPT-5.4 Pro scored 83.3 percent in January; Gemini 3.1 Pro scored 77.1 percent in February.
- Confluence Lab pushed ARC-AGI-2 scores to 97.9 percent by April 2026, a research result that effectively saturated the benchmark.
- ARC-AGI-3 was launched in March 2026 as the new frontier benchmark. Gemini 3.1 Pro scored 0.37 percent in the initial evaluation; no frontier model has crossed 5 percent as of May.
- FrontierMath, the closed-set advanced mathematics benchmark, saw the leading model (GPT-5.5 with mathematical-reasoning tools) hit 53 percent in March 2026, up from 25 percent in late 2025.
- Humanity's Last Exam, the broad-knowledge frontier benchmark, has the leading frontier model at approximately 38 percent in May 2026.
ARC-AGI-2 Leaderboard (May 2026)
| Model or System | Score | Compute Tier | Date |
|---|---|---|---|
| Confluence Lab research stack | 97.9% | Unconstrained | Apr 2026 |
| GPT-5.5 | 85.0% | High | Mar 2026 |
| GPT-5.4 Pro | 83.3% | High | Jan 2026 |
| Claude 4.7 Opus | 81.4% | High | Mar 2026 |
| Gemini 3.1 Pro Deep Think | 77.1% | High | Feb 2026 |
| Grok 4 | 72.8% | High | Mar 2026 |
| DeepSeek V4 Reasoning | 68.5% | High | Apr 2026 |
| Qwen 3.5 Max Thinking | 61.2% | High | Apr 2026 |
| Human average | 66% | n/a | Baseline |
| Human top decile | 92% | n/a | Baseline |
ARC-AGI-3 Initial Evaluation
| Model | Score | Date |
|---|---|---|
| Gemini 3.1 Pro | 0.37% | Mar 2026 |
| GPT-5.5 | 1.8% | Apr 2026 |
| Claude 4.7 Opus | 2.1% | Apr 2026 |
| Confluence Lab early run | 4.5% | May 2026 |
| Human average | ~71% | Baseline |
Adjacent Reasoning Benchmarks
| Benchmark | Leading Model May 2026 | Score | 2024 Baseline |
|---|---|---|---|
| FrontierMath | GPT-5.5 with tools | 53% | ~2% (GPT-4o) |
| Humanity's Last Exam | GPT-5.5 | ~38% | ~9% (GPT-4o) |
| GPQA-Diamond | Claude 4.7 Opus | ~88% | ~50% (GPT-4o) |
| SWE-Bench Verified | Claude 4.7 Opus + Claude Code | ~82% | ~13% (GPT-4o) |
| USAMO 2025 | GPT-5.5 with tools | ~60% | n/a |
| MATH | Multiple at saturation | ~99% | ~70% (GPT-4o) |
| MMLU | Multiple at saturation | ~92% | ~88% (GPT-4o) |
| MMLU-Pro | GPT-5.5 | ~84% | ~70% (GPT-4o) |
What ARC-AGI Measures
ARC-AGI tests fluid intelligence on novel reasoning puzzles that cannot be solved by retrieval or memorisation. Each task consists of a small number of input-output examples; the model must induce the underlying transformation and apply it to a new input. The benchmark resists data contamination because every puzzle is novel, and it resists scaling laws because raw parameter count provides diminishing returns. ARC-AGI-2 is harder than ARC-AGI-1 along three axes: more complex transformations, larger grid sizes, and adversarially selected puzzles that defeat common heuristics. ARC-AGI-3 (launched March 2026) adds compositional reasoning over multiple steps and demands generalisation across visually dissimilar instances of the same underlying rule.
Strategic Context
The pace of progress on ARC-AGI-2 in Q1 2026 was unexpected. The grand-prize threshold of 85 percent was not anticipated to be crossed in 2026 by the consensus of researchers polled in late 2025. The progress reflects three trends: scaling of reasoning compute (test-time inference budget), training on synthetic ARC-style problems, and architectural improvements in attention over visual-grid representations. Whether ARC-AGI-3 falls to similar techniques over the next 12 to 18 months is the most-watched question in frontier evaluation.
Brand Visibility Implications
Frontier benchmark progress is heavily covered in technical and business AI journalism. Brands selling AI reasoning tooling, AI agents, AI for science, and AI evaluation services face strong AI-mediated discovery surface for queries like "best reasoning model 2026", "GPT-5 vs Claude 4.7 reasoning", "AI math benchmark", and similar long-tail terms. The category moves fast enough that AI assistant recommendations frequently lag the underlying benchmark state, creating opportunity for first-mover brands that produce up-to-date comparison content.
Methodology
Benchmark scores compiled from ARC Prize, Epoch AI, lab announcements, and peer-reviewed evaluation papers through 22 May 2026. Scores reflect publicly reported results; some lab-internal benchmarks may differ. Updated monthly with quarterly deep-dive analyses.
How Presenc AI Helps
Presenc AI monitors brand-mention rates on reasoning-benchmark queries across ChatGPT, Claude, Gemini, and Perplexity. For brands selling AI reasoning tools, AI evaluation services, or AI for science, this is the operational visibility into how the journalism cycle around benchmark progress translates into AI-mediated discovery.