Research

HumanEval Leaderboard June 2026

HumanEval pass@1 leaderboard for June 2026. Most frontier models now exceed 95%; the meaningful differentiation has shifted to SWE-bench Verified and LiveCodeBench.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

HumanEval is a 164-problem Python function-completion benchmark from OpenAI that has anchored coding-model evaluation since 2021. By June 2026 most frontier models exceed 95% pass@1, making it less discriminating than SWE-bench Verified or LiveCodeBench, but the leaderboard remains widely cited.

June 2026 Leaderboard (pass@1)

RankModelVendorHumanEval %
1Claude Mythos 5Anthropic~98.8%
2GPT-5.6 ProOpenAI~98.5%
3Claude Opus 4.7Anthropic~98.2%
4DeepSeek V4.1 ProDeepSeek~97.8%
5Qwen 3.7Alibaba~97.5%
6GPT-5.6OpenAI~97.3%
7Gemini 3.2 ProGoogle~97.1%
8Claude Sonnet 4.6Anthropic~96.8%
9GLM-6Zhipu AI~96.0%
10Llama 4.5 MaverickMeta~95.5%
11Mistral Large 3Mistral AI~94.8%
12Hunyuan Large 3Tencent~94.5%

Why HumanEval Is Saturated

Twelve frontier models now sit within roughly 4 percentage points on HumanEval. Score differences inside the saturation band are largely noise. Practitioners increasingly cite SWE-bench Verified, LiveCodeBench, and BigCodeBench for meaningful coding-capability differentiation.

Methodology

Scores compiled from vendor disclosures and the HumanEval public leaderboard. Numbers expressed as approximate values; the saturation regime means rank order should not be over-interpreted. Updated monthly.

How Presenc AI Helps

Presenc AI tracks how brand visibility shifts as developer tooling integrates new frontier coding models. HumanEval scores no longer predict deployment share; agentic-coding benchmarks like SWE-bench Verified do.

Frequently Asked Questions

A 164-problem Python function-completion benchmark released by OpenAI in 2021. The pass@1 metric measures the share of problems solved on the first attempt.
Twelve frontier models now exceed 95% pass@1. Score differences inside the saturation band are largely noise rather than meaningful capability differences.
SWE-bench Verified (real GitHub issues), LiveCodeBench (contamination-resistant), BigCodeBench (multi-step), and agentic benchmarks like WebArena and OSWorld for coding-related tasks.
Yes for completeness, but rank order should not drive procurement decisions in 2026. SWE-bench Verified and LiveCodeBench scores discriminate frontier models more meaningfully.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.