Research

Agentic Benchmark Leaderboard June 2026

Composite agentic-task leaderboard for June 2026 across WebArena, OSWorld, AgentBench, and TerminalBench. GPT-5.6, Claude Mythos 5, and Gemini 3.2 Pro lead.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

Agentic benchmarks measure how reliably models complete multi-step tasks autonomously across browser, OS, and terminal environments. This page snapshots a composite ranking across WebArena, OSWorld, AgentBench, and TerminalBench as of June 2026.

June 2026 Composite Ranking

Rank	Model	WebArena	OSWorld	TerminalBench	AgentBench
1	GPT-5.6 Pro	~62%	~52%	~85%	~78%
2	Claude Mythos 5	~61%	~51%	~84%	~80%
3	Claude Opus 4.7	~58%	~48%	~82%	~76%
4	Gemini 3.2 Pro	~55%	~46%	~78%	~73%
5	GPT-5.6	~52%	~44%	~76%	~70%
6	DeepSeek V4.1 Pro	~48%	~40%	~72%	~65%
7	Claude Sonnet 4.6	~46%	~38%	~70%	~62%
8	Qwen 3.7	~43%	~35%	~66%	~58%
9	GLM-6	~38%	~30%	~58%	~52%
10	Llama 4.5 Maverick	~32%	~26%	~52%	~46%

Key Takeaways

Agentic benchmarks remain the most discriminating frontier-model evaluations; spreads are 2x wider than on MMLU-Pro.
GPT-5.6 Pro leads narrowly on browser-based agentic tasks; Claude Mythos 5 leads on long-horizon AgentBench.
Top model OSWorld scores around 52% remain well below the ~85% human baseline for the same tasks.
Open-weight DeepSeek V4.1 Pro sits within 10 to 14 percentage points of the top closed models.

Methodology

Scores compiled from vendor disclosures and the public leaderboards for WebArena, OSWorld, AgentBench, and TerminalBench. Agentic benchmark evaluation is highly sensitive to scaffolding choices; numbers should be treated as directional. Updated monthly.

How Presenc AI Helps

Presenc AI tracks brand visibility on the agentic models that increasingly run shopping research, vendor evaluation, and procurement workflows inside enterprise contexts.

Frequently Asked Questions

WebArena (browser tasks), OSWorld (desktop OS tasks), AgentBench (multi-domain), TerminalBench (terminal tasks), and emerging variants like AndroidWorld for mobile and AppWorld for app workflows.

GPT-5.6 Pro from OpenAI narrowly ahead overall, with Claude Mythos 5 leading on long-horizon AgentBench specifically.

Top model OSWorld scores around 52% remain well below the ~85% human baseline. Browser tasks have closed more of the gap. Long-horizon multi-app workflows remain the hardest open problem.

Scaffolding choices (planner architecture, memory, retry policies) materially change results. A single model can score 10 to 20 percentage points differently across credible scaffolding choices.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.