This page snapshots the capability parity gap between the top open-weight model and the top closed-frontier model on each major benchmark as of June 2026.
Benchmark Gap Comparison (June 2026)
| Benchmark | Top Closed | Top Open-Weight | Gap |
|---|---|---|---|
| SWE-bench Verified | Claude Mythos 5 ~78% | DeepSeek V4.1 Pro ~69% | ~9 pts |
| HumanEval | Claude Mythos 5 ~98.8% | DeepSeek V4.1 Pro ~97.8% | ~1 pt |
| GPQA Diamond | Claude Mythos 5 ~88% | DeepSeek V4.1 Pro ~75% | ~13 pts |
| MMLU-Pro | GPT-5.6 Pro ~84% | DeepSeek V4.1 Pro ~74% | ~10 pts |
| Chatbot Arena Elo (Hard) | GPT-5.6 Pro ~1465 | DeepSeek V4.1 Pro ~1410 | ~55 Elo |
| WebArena | GPT-5.6 Pro ~62% | DeepSeek V4.1 Pro ~48% | ~14 pts |
| OSWorld | GPT-5.6 Pro ~52% | DeepSeek V4.1 Pro ~40% | ~12 pts |
| TerminalBench | GPT-5.6 Pro ~85% | DeepSeek V4.1 Pro ~72% | ~13 pts |
| Context window (max) | Gemini 3.2 Pro 2M | Llama 4.5 Scout 10M | Open leads 5x |
| Input pricing (per 1M) | GPT-5.6 ~$5 | DeepSeek V4.1 Flash ~$0.12 | Open ~42x cheaper |
Key Takeaways
- HumanEval gap has effectively closed (~1 point) because both ends of the comparison sit in the saturation regime.
- SWE-bench Verified gap of ~9 points is the most consequential remaining frontier-coding differentiator.
- Agentic benchmarks (WebArena, OSWorld, TerminalBench) show the most persistent closed-model advantage at ~12-14 points.
- Open-weight wins on context window (Llama 4.5 Scout 5x larger than top closed) and pricing (~42x cheaper at the cheapest tier).
- The trade-off pattern is clear: open-weight matches on raw text generation; closed-model leads on agentic-task reliability.
What This Means for Brand Visibility
Brands that rely on agentic-task-driven discovery surfaces (browser agents, OS agents, terminal copilots) face higher closed-model exposure. Brands appearing primarily in chat-style retrieval queries face roughly equivalent open-weight and closed-weight visibility patterns, with cost dynamics increasingly favoring open-weight deployment in developer tools and self-hosted enterprise contexts.
Methodology
Benchmark scores from vendor disclosures and public leaderboards as of June 2026. Top open-weight selected per benchmark; the same DeepSeek V4.1 Pro entry appears across multiple categories reflecting its current open-weight leadership. Updated monthly.
How Presenc AI Helps
Presenc AI tracks brand visibility across both closed and open-weight models so brand teams see the full discovery surface where their brand competes.