MMLU-Pro is the harder successor to the original MMLU benchmark, expanding to roughly 12,000 questions across 14 disciplines with ten answer choices per question instead of four. This page snapshots the public leaderboard as of June 2026.
June 2026 Leaderboard
| Rank | Model | Vendor | MMLU-Pro % |
|---|---|---|---|
| 1 | GPT-5.6 Pro | OpenAI | ~84% |
| 2 | Claude Mythos 5 | Anthropic | ~83% |
| 3 | Claude Opus 4.7 | Anthropic | ~82% |
| 4 | Gemini 3.2 Pro | ~80% | |
| 5 | GPT-5.6 | OpenAI | ~78% |
| 6 | Claude Sonnet 4.6 | Anthropic | ~76% |
| 7 | DeepSeek V4.1 Pro | DeepSeek | ~74% |
| 8 | Qwen 3.7 | Alibaba | ~72% |
| 9 | Gemini 3.2 Flash | ~70% | |
| 10 | GLM-6 | Zhipu AI | ~68% |
| 11 | Llama 4.5 Maverick | Meta | ~66% |
| 12 | Mistral Large 3 | Mistral AI | ~64% |
Key Takeaways
- The top three models cluster within 2 percentage points; differentiation has compressed since Q1 2026.
- The original MMLU is saturated above 90% for all major frontier models; MMLU-Pro is the meaningful successor.
- Open-weight DeepSeek V4.1 Pro at ~74% sits within 10 points of the top frontier closed model.
- The base-vs-Pro gap is approximately 6 percentage points for GPT-5.6 and similar for Gemini 3.2.
Methodology
Scores compiled from vendor disclosures and the MMLU-Pro public leaderboard. Numbers approximate; multiple-choice benchmarks are sensitive to prompt template choices. Updated monthly.
How Presenc AI Helps
Presenc AI tracks how shifts in general-capability rankings shape brand-visibility behavior across the broad set of consumer and enterprise queries where general capability matters.