Research

Open-Weight vs Closed Frontier Snapshot June 2026

Capability parity gap between top open-weight and top closed-frontier LLMs in June 2026. Single-digit gaps on most benchmarks; agentic tasks remain the persistent closed-model advantage.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

This page snapshots the capability parity gap between the top open-weight model and the top closed-frontier model on each major benchmark as of June 2026.

Benchmark Gap Comparison (June 2026)

BenchmarkTop ClosedTop Open-WeightGap
SWE-bench VerifiedClaude Mythos 5 ~78%DeepSeek V4.1 Pro ~69%~9 pts
HumanEvalClaude Mythos 5 ~98.8%DeepSeek V4.1 Pro ~97.8%~1 pt
GPQA DiamondClaude Mythos 5 ~88%DeepSeek V4.1 Pro ~75%~13 pts
MMLU-ProGPT-5.6 Pro ~84%DeepSeek V4.1 Pro ~74%~10 pts
Chatbot Arena Elo (Hard)GPT-5.6 Pro ~1465DeepSeek V4.1 Pro ~1410~55 Elo
WebArenaGPT-5.6 Pro ~62%DeepSeek V4.1 Pro ~48%~14 pts
OSWorldGPT-5.6 Pro ~52%DeepSeek V4.1 Pro ~40%~12 pts
TerminalBenchGPT-5.6 Pro ~85%DeepSeek V4.1 Pro ~72%~13 pts
Context window (max)Gemini 3.2 Pro 2MLlama 4.5 Scout 10MOpen leads 5x
Input pricing (per 1M)GPT-5.6 ~$5DeepSeek V4.1 Flash ~$0.12Open ~42x cheaper

Key Takeaways

  • HumanEval gap has effectively closed (~1 point) because both ends of the comparison sit in the saturation regime.
  • SWE-bench Verified gap of ~9 points is the most consequential remaining frontier-coding differentiator.
  • Agentic benchmarks (WebArena, OSWorld, TerminalBench) show the most persistent closed-model advantage at ~12-14 points.
  • Open-weight wins on context window (Llama 4.5 Scout 5x larger than top closed) and pricing (~42x cheaper at the cheapest tier).
  • The trade-off pattern is clear: open-weight matches on raw text generation; closed-model leads on agentic-task reliability.

What This Means for Brand Visibility

Brands that rely on agentic-task-driven discovery surfaces (browser agents, OS agents, terminal copilots) face higher closed-model exposure. Brands appearing primarily in chat-style retrieval queries face roughly equivalent open-weight and closed-weight visibility patterns, with cost dynamics increasingly favoring open-weight deployment in developer tools and self-hosted enterprise contexts.

Methodology

Benchmark scores from vendor disclosures and public leaderboards as of June 2026. Top open-weight selected per benchmark; the same DeepSeek V4.1 Pro entry appears across multiple categories reflecting its current open-weight leadership. Updated monthly.

How Presenc AI Helps

Presenc AI tracks brand visibility across both closed and open-weight models so brand teams see the full discovery surface where their brand competes.

Frequently Asked Questions

Within ~9 points on SWE-bench Verified and ~1 point on HumanEval. The gap is wider (~12-14 points) on agentic benchmarks where closed models maintain a persistent advantage.
DeepSeek V4.1 Pro consistently leads or ties for the top open-weight slot across SWE-bench Verified, GPQA Diamond, MMLU-Pro, Arena Hard, and agentic benchmarks.
Context window (Llama 4.5 Scout 10M tokens vs Gemini 3.2 Pro 2M) and pricing (DeepSeek V4.1 Flash ~$0.12 per million input vs GPT-5.6 ~$5).
On chat-style benchmarks, likely yes; the trajectory is clear. On agentic tasks, the closed-model advantage may persist longer because scaffolding and tool-use tuning are harder to replicate from weights alone.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.