What "Agent Capability" Actually Measures in 2026
The phrase "AI agent" covers a wide spectrum: simple tool-using chatbots, autonomous code-fixing systems, multi-step browsing agents, sales SDR systems. Capability is not a single number, it is a profile across reasoning, tool use, planning, error recovery, and grounding. This page consolidates published benchmark results across the dimensions that actually predict production performance.
Key Findings
- On SWE-Bench Verified, frontier coding agents in May 2026 cluster around 70-78 percent task completion, up from 13 percent in early 2024 and 49 percent in early 2025, the steepest capability climb of any agent category.
- On GAIA (general AI assistant benchmark), top agents reach 78-82 percent on level 1, 60-68 percent on level 2, 35-45 percent on level 3; the level-3 gap reveals where current agents still fail.
- On BFCL v3 (Berkeley Function-Calling Leaderboard), function-calling accuracy at 5+ tools is 85-92 percent for frontier models, dropping to 65-78 percent at 20+ tools, suggesting tool-orchestration is a real production bottleneck.
- WebArena and VisualWebArena task-completion rates remain in the 35-55 percent range for browsing agents, materially below code agents, the gap is execution-environment friction (slow page loads, dynamic UI, anti-bot measures), not reasoning.
- End-to-end agent reliability (rate of task completion across 100 attempts on the same task) is 60-75 percent for production agents, below human baselines but rising.
SWE-Bench Verified Leaderboard (Coding Agents, May 2026 snapshot)
| Agent | Underlying Model | SWE-Bench Verified % |
|---|---|---|
| Claude Code (Opus 4.7) | Anthropic Claude Opus 4.7 | ~76-78% |
| OpenAI Codex agent (GPT-5 Pro) | OpenAI GPT-5 Pro | ~74-76% |
| Devin (Cognition AI) | Multi-model orchestration | ~52-58% |
| Cursor Agent (Sonnet 4.6) | Anthropic Claude Sonnet 4.6 | ~63-67% |
| Aider (Sonnet 4.6) | Anthropic Claude Sonnet 4.6 | ~58-63% |
| Cline (open-weight backed) | Various open-weight | ~38-45% |
| Open-source agent + Llama 4 70B | Llama 4 70B | ~25-32% |
Figures from swebench.com leaderboard plus vendor-published evaluations. Re-evaluations and contamination concerns make absolute numbers approximate; relative ranking is more reliable.
GAIA Benchmark (General Assistant Tasks, May 2026 snapshot)
| Agent | Level 1 % | Level 2 % | Level 3 % | Overall % |
|---|---|---|---|---|
| Top frontier agent | 78-82 | 60-68 | 35-45 | ~62-68 |
| Mid-tier production agent | 65-72 | 45-55 | 20-30 | ~48-55 |
| Open-source agent (Llama 4) | 50-60 | 30-40 | 10-18 | ~32-40 |
| Human baseline | ~92 | ~92 | ~92 | ~92 |
Function-Calling Accuracy (BFCL v3, Tool-Use)
| Model / Agent | Single-tool | 5 tools | 20+ tools |
|---|---|---|---|
| Claude Opus 4.7 | 96% | 91% | 76% |
| GPT-5 Pro | 95% | 90% | 74% |
| Gemini 2.5 Pro | 93% | 87% | 69% |
| Qwen 3 32B | 89% | 82% | 58% |
| Llama 4 70B | 87% | 79% | 54% |
Browsing Agent Performance (WebArena 2026)
| Browsing Agent | Task completion % | Median time/task |
|---|---|---|
| Claude Computer Use 2 (Opus 4.7) | ~52-56% | ~3.4 min |
| OpenAI Operator (GPT-5) | ~48-53% | ~3.0 min |
| Browser Use + Llama 4 | ~32-38% | ~5.2 min |
| Atlas agentic mode | ~50-54% | ~4.2 min |
| Comet agentic mode | ~46-50% | ~6.1 min |
End-to-End Reliability (100-attempt repeated-task success)
Capability on a single attempt overstates production performance. Across 100 repeated attempts on the same task, reliability decays:
- Coding agents on SWE-Bench-style tasks: 60-72 percent completion across repeats (frontier)
- Browsing agents on WebArena tasks: 38-48 percent across repeats
- Tool-use agents on BFCL composite tasks: 75-83 percent across repeats
Reliability gap (best-case vs. expected) is 8-22 percent depending on category, the implication is that production agent SLAs cannot rely on best-case benchmarks.
Capability Tier Framework (Adapted from Lenny's Newsletter)
A capability framework that aligns with Lenny Rachitsky's "Not all AI agents are created equal" taxonomy, mapped to measurable benchmarks:
| Tier | Description | Benchmark Surrogate |
|---|---|---|
| Tier 1: Chat with tools | Stateless, single tool call, no memory | BFCL single-tool 95%+ |
| Tier 2: Workflow automation | Pre-defined steps, deterministic flow | Custom workflow tests, not agent benchmarks |
| Tier 3: Tool-orchestrating agent | Dynamic multi-step, 5-20 tools | BFCL 5-20 tools, GAIA L1 |
| Tier 4: Autonomous task agent | Long-horizon, error recovery, planning | SWE-Bench Verified, GAIA L2-L3 |
| Tier 5: Multi-agent system | Coordinated agents, role specialisation | Custom multi-agent tests, no standard benchmark yet |
Brand Visibility Implications
Three implications. First, agents at Tier 3+ make brand recommendations during multi-step tasks (research, comparison shopping, vendor selection); benchmark performance directly affects how often your brand surfaces correctly. Second, function-calling accuracy at 20+ tools (where production agents operate) is the practical bottleneck for being correctly recommended, agents that drop tools cannot recommend you regardless of training-data presence. Third, browsing agents at 35-55 percent task completion still shape brand visibility every time they succeed; presence in their context windows during their successful attempts matters. See how AI agents choose brands for the brand-visibility-specific analysis.
Methodology
Benchmark figures from SWE-Bench Verified leaderboard, GAIA Hugging Face leaderboard, Berkeley Function-Calling Leaderboard, WebArena. Vendor-published numbers cross-checked against independent re-evaluations where available. Benchmarks have known contamination issues; absolute numbers should be treated with appropriate caution, relative rankings are more reliable. Updated quarterly as new model releases ship.
How Presenc AI Helps
Presenc AI tracks brand-mention rates during agent task execution across the major agent platforms (Claude Code, Operator, Atlas, Comet, Devin, Cursor), surfacing how brand presence varies between agent capability tiers and between successful and failed task attempts. For brand teams operating in agent-mediated buyer journeys, this is the operational connection between agent benchmarks and brand exposure.