Agentic benchmarks measure how reliably models complete multi-step tasks autonomously across browser, OS, and terminal environments. This page snapshots a composite ranking across WebArena, OSWorld, AgentBench, and TerminalBench as of June 2026.
June 2026 Composite Ranking
| Rank | Model | WebArena | OSWorld | TerminalBench | AgentBench |
|---|---|---|---|---|---|
| 1 | GPT-5.6 Pro | ~62% | ~52% | ~85% | ~78% |
| 2 | Claude Mythos 5 | ~61% | ~51% | ~84% | ~80% |
| 3 | Claude Opus 4.7 | ~58% | ~48% | ~82% | ~76% |
| 4 | Gemini 3.2 Pro | ~55% | ~46% | ~78% | ~73% |
| 5 | GPT-5.6 | ~52% | ~44% | ~76% | ~70% |
| 6 | DeepSeek V4.1 Pro | ~48% | ~40% | ~72% | ~65% |
| 7 | Claude Sonnet 4.6 | ~46% | ~38% | ~70% | ~62% |
| 8 | Qwen 3.7 | ~43% | ~35% | ~66% | ~58% |
| 9 | GLM-6 | ~38% | ~30% | ~58% | ~52% |
| 10 | Llama 4.5 Maverick | ~32% | ~26% | ~52% | ~46% |
Key Takeaways
- Agentic benchmarks remain the most discriminating frontier-model evaluations; spreads are 2x wider than on MMLU-Pro.
- GPT-5.6 Pro leads narrowly on browser-based agentic tasks; Claude Mythos 5 leads on long-horizon AgentBench.
- Top model OSWorld scores around 52% remain well below the ~85% human baseline for the same tasks.
- Open-weight DeepSeek V4.1 Pro sits within 10 to 14 percentage points of the top closed models.
Methodology
Scores compiled from vendor disclosures and the public leaderboards for WebArena, OSWorld, AgentBench, and TerminalBench. Agentic benchmark evaluation is highly sensitive to scaffolding choices; numbers should be treated as directional. Updated monthly.
How Presenc AI Helps
Presenc AI tracks brand visibility on the agentic models that increasingly run shopping research, vendor evaluation, and procurement workflows inside enterprise contexts.