Research

SWE-bench Verified Leaderboard June 2026

SWE-bench Verified leaderboard for June 2026. Claude Opus 4.7 and Mythos 5 lead the closed frontier; DeepSeek V4.1 and Qwen 3.7 close the open-weight gap.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

SWE-bench Verified is the most-cited real-world coding benchmark for frontier LLMs, measuring resolved-issue rate on a curated set of GitHub issues from popular Python repositories. This page snapshots the public leaderboard as of June 2026.

June 2026 Leaderboard

RankModelVendorSWE-bench Verified %
1Claude Mythos 5Anthropic~78%
2Claude Opus 4.7Anthropic~75%
3GPT-5.6 ProOpenAI~73%
4GPT-5.6OpenAI~70%
5DeepSeek V4.1 ProDeepSeek~69%
6Claude Sonnet 4.6Anthropic~68%
7Qwen 3.7Alibaba~66%
8Gemini 3.2 ProGoogle~65%
9DeepSeek V4.1 FlashDeepSeek~63%
10GLM-6Zhipu AI~58%
11Llama 4.5 MaverickMeta~55%
12Mistral Large 3Mistral AI~52%

Key Takeaways

  • Claude Mythos 5 GA in June 2026 took the top spot from Claude Opus 4.7.
  • Open-weight DeepSeek V4.1 Pro sits within ~6 points of frontier closed-model performance.
  • Qwen 3.7 leads the Chinese frontier set on coding evaluations.
  • The gap between top closed and top open-weight has narrowed to single digits in 12 months.

Methodology

Scores compiled from vendor disclosures, the public SWE-bench Verified leaderboard at swebench.com, and third-party replication where available. Numbers expressed as ranges or rounded values; treat as directional pending independent verification. Updated monthly.

How Presenc AI Helps

Presenc AI tracks how frontier coding capability shifts shape brand visibility inside developer tools and self-hosted enterprise deployments where these models get embedded.

Frequently Asked Questions

A coding benchmark measuring resolved-issue rate on a curated set of real GitHub issues from popular Python repositories. Verified means human-validated as solvable.
Claude Mythos 5 from Anthropic at approximately 78%, narrowly ahead of Claude Opus 4.7 and GPT-5.6 Pro.
Within roughly 6 percentage points as of June 2026. DeepSeek V4.1 Pro at approximately 69% sits close to GPT-5.6 base and behind only the top three closed models.
Material reorderings happen roughly every six to eight weeks as new frontier models ship. Smaller updates from fine-tunes and inference improvements happen weekly.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.