How close is open-weight to closed-frontier in June 2026?

Within ~9 points on SWE-bench Verified and ~1 point on HumanEval. The gap is wider (~12-14 points) on agentic benchmarks where closed models maintain a persistent advantage.

Which open-weight model leads in June 2026?

DeepSeek V4.1 Pro consistently leads or ties for the top open-weight slot across SWE-bench Verified, GPQA Diamond, MMLU-Pro, Arena Hard, and agentic benchmarks.

On what dimensions does open-weight beat closed?

Context window (Llama 4.5 Scout 10M tokens vs Gemini 3.2 Pro 2M) and pricing (DeepSeek V4.1 Flash ~$0.12 per million input vs GPT-5.6 ~$5).

Will the gap close further in 2026?

On chat-style benchmarks, likely yes; the trajectory is clear. On agentic tasks, the closed-model advantage may persist longer because scaffolding and tool-use tuning are harder to replicate from weights alone.

Open-Weight vs Closed Frontier Snapshot June 2026

This page snapshots the capability parity gap between the top open-weight model and the top closed-frontier model on each major benchmark as of June 2026.

Benchmark Gap Comparison (June 2026)

Benchmark	Top Closed	Top Open-Weight	Gap
SWE-bench Verified	Claude Mythos 5 ~78%	DeepSeek V4.1 Pro ~69%	~9 pts
HumanEval	Claude Mythos 5 ~98.8%	DeepSeek V4.1 Pro ~97.8%	~1 pt
GPQA Diamond	Claude Mythos 5 ~88%	DeepSeek V4.1 Pro ~75%	~13 pts
MMLU-Pro	GPT-5.6 Pro ~84%	DeepSeek V4.1 Pro ~74%	~10 pts
Chatbot Arena Elo (Hard)	GPT-5.6 Pro ~1465	DeepSeek V4.1 Pro ~1410	~55 Elo
WebArena	GPT-5.6 Pro ~62%	DeepSeek V4.1 Pro ~48%	~14 pts
OSWorld	GPT-5.6 Pro ~52%	DeepSeek V4.1 Pro ~40%	~12 pts
TerminalBench	GPT-5.6 Pro ~85%	DeepSeek V4.1 Pro ~72%	~13 pts
Context window (max)	Gemini 3.2 Pro 2M	Llama 4.5 Scout 10M	Open leads 5x
Input pricing (per 1M)	GPT-5.6 ~$5	DeepSeek V4.1 Flash ~$0.12	Open ~42x cheaper

Key Takeaways

HumanEval gap has effectively closed (~1 point) because both ends of the comparison sit in the saturation regime.
SWE-bench Verified gap of ~9 points is the most consequential remaining frontier-coding differentiator.
Agentic benchmarks (WebArena, OSWorld, TerminalBench) show the most persistent closed-model advantage at ~12-14 points.
Open-weight wins on context window (Llama 4.5 Scout 5x larger than top closed) and pricing (~42x cheaper at the cheapest tier).
The trade-off pattern is clear: open-weight matches on raw text generation; closed-model leads on agentic-task reliability.

What This Means for Brand Visibility

Brands that rely on agentic-task-driven discovery surfaces (browser agents, OS agents, terminal copilots) face higher closed-model exposure. Brands appearing primarily in chat-style retrieval queries face roughly equivalent open-weight and closed-weight visibility patterns, with cost dynamics increasingly favoring open-weight deployment in developer tools and self-hosted enterprise contexts.

Methodology

Benchmark scores from vendor disclosures and public leaderboards as of June 2026. Top open-weight selected per benchmark; the same DeepSeek V4.1 Pro entry appears across multiple categories reflecting its current open-weight leadership. Updated monthly.

How Presenc AI Helps

Presenc AI tracks brand visibility across both closed and open-weight models so brand teams see the full discovery surface where their brand competes.