Research

MMLU-Pro Leaderboard June 2026

MMLU-Pro general capability leaderboard for June 2026. The expanded 12,000-question benchmark continues to separate frontier from mid-tier models more clearly than original MMLU.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

MMLU-Pro is the harder successor to the original MMLU benchmark, expanding to roughly 12,000 questions across 14 disciplines with ten answer choices per question instead of four. This page snapshots the public leaderboard as of June 2026.

June 2026 Leaderboard

RankModelVendorMMLU-Pro %
1GPT-5.6 ProOpenAI~84%
2Claude Mythos 5Anthropic~83%
3Claude Opus 4.7Anthropic~82%
4Gemini 3.2 ProGoogle~80%
5GPT-5.6OpenAI~78%
6Claude Sonnet 4.6Anthropic~76%
7DeepSeek V4.1 ProDeepSeek~74%
8Qwen 3.7Alibaba~72%
9Gemini 3.2 FlashGoogle~70%
10GLM-6Zhipu AI~68%
11Llama 4.5 MaverickMeta~66%
12Mistral Large 3Mistral AI~64%

Key Takeaways

  • The top three models cluster within 2 percentage points; differentiation has compressed since Q1 2026.
  • The original MMLU is saturated above 90% for all major frontier models; MMLU-Pro is the meaningful successor.
  • Open-weight DeepSeek V4.1 Pro at ~74% sits within 10 points of the top frontier closed model.
  • The base-vs-Pro gap is approximately 6 percentage points for GPT-5.6 and similar for Gemini 3.2.

Methodology

Scores compiled from vendor disclosures and the MMLU-Pro public leaderboard. Numbers approximate; multiple-choice benchmarks are sensitive to prompt template choices. Updated monthly.

How Presenc AI Helps

Presenc AI tracks how shifts in general-capability rankings shape brand-visibility behavior across the broad set of consumer and enterprise queries where general capability matters.

Frequently Asked Questions

The harder successor to MMLU, expanding to approximately 12,000 questions across 14 disciplines with ten answer choices per question instead of four. Designed to remain discriminating as frontier models saturate the original MMLU.
GPT-5.6 Pro from OpenAI at approximately 84%, narrowly ahead of Claude Mythos 5 at 83% and Claude Opus 4.7 at 82%.
All major frontier models exceed approximately 90% on the original MMLU as of mid-2026, making score differences inside that saturation band largely noise rather than meaningful capability differences.
DeepSeek V4.1 Pro at approximately 74% leads the open-weight category and sits within 10 points of the top frontier closed model.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.