SWE-bench Verified is the most-cited real-world coding benchmark for frontier LLMs, measuring resolved-issue rate on a curated set of GitHub issues from popular Python repositories. This page snapshots the public leaderboard as of June 2026.
June 2026 Leaderboard
| Rank | Model | Vendor | SWE-bench Verified % |
|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | ~78% |
| 2 | Claude Opus 4.7 | Anthropic | ~75% |
| 3 | GPT-5.6 Pro | OpenAI | ~73% |
| 4 | GPT-5.6 | OpenAI | ~70% |
| 5 | DeepSeek V4.1 Pro | DeepSeek | ~69% |
| 6 | Claude Sonnet 4.6 | Anthropic | ~68% |
| 7 | Qwen 3.7 | Alibaba | ~66% |
| 8 | Gemini 3.2 Pro | ~65% | |
| 9 | DeepSeek V4.1 Flash | DeepSeek | ~63% |
| 10 | GLM-6 | Zhipu AI | ~58% |
| 11 | Llama 4.5 Maverick | Meta | ~55% |
| 12 | Mistral Large 3 | Mistral AI | ~52% |
Key Takeaways
- Claude Mythos 5 GA in June 2026 took the top spot from Claude Opus 4.7.
- Open-weight DeepSeek V4.1 Pro sits within ~6 points of frontier closed-model performance.
- Qwen 3.7 leads the Chinese frontier set on coding evaluations.
- The gap between top closed and top open-weight has narrowed to single digits in 12 months.
Methodology
Scores compiled from vendor disclosures, the public SWE-bench Verified leaderboard at swebench.com, and third-party replication where available. Numbers expressed as ranges or rounded values; treat as directional pending independent verification. Updated monthly.
How Presenc AI Helps
Presenc AI tracks how frontier coding capability shifts shape brand visibility inside developer tools and self-hosted enterprise deployments where these models get embedded.