Research

HumanEval Leaderboard June 2026

HumanEval pass@1 leaderboard for June 2026. Most frontier models now exceed 95%; the meaningful differentiation has shifted to SWE-bench Verified and LiveCodeBench.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: June 2026

HumanEval is a 164-problem Python function-completion benchmark from OpenAI that has anchored coding-model evaluation since 2021. By June 2026 most frontier models exceed 95% pass@1, making it less discriminating than SWE-bench Verified or LiveCodeBench, but the leaderboard remains widely cited.

June 2026 Leaderboard (pass@1)

Rank	Model	Vendor	HumanEval %
1	Claude Mythos 5	Anthropic	~98.8%
2	GPT-5.6 Pro	OpenAI	~98.5%
3	Claude Opus 4.7	Anthropic	~98.2%
4	DeepSeek V4.1 Pro	DeepSeek	~97.8%
5	Qwen 3.7	Alibaba	~97.5%
6	GPT-5.6	OpenAI	~97.3%
7	Gemini 3.2 Pro	Google	~97.1%
8	Claude Sonnet 4.6	Anthropic	~96.8%
9	GLM-6	Zhipu AI	~96.0%
10	Llama 4.5 Maverick	Meta	~95.5%
11	Mistral Large 3	Mistral AI	~94.8%
12	Hunyuan Large 3	Tencent	~94.5%

Why HumanEval Is Saturated

Twelve frontier models now sit within roughly 4 percentage points on HumanEval. Score differences inside the saturation band are largely noise. Practitioners increasingly cite SWE-bench Verified, LiveCodeBench, and BigCodeBench for meaningful coding-capability differentiation.

Methodology

Scores compiled from vendor disclosures and the HumanEval public leaderboard. Numbers expressed as approximate values; the saturation regime means rank order should not be over-interpreted. Updated monthly.

How Presenc AI Helps

Presenc AI tracks how brand visibility shifts as developer tooling integrates new frontier coding models. HumanEval scores no longer predict deployment share; agentic-coding benchmarks like SWE-bench Verified do.

Frequently Asked Questions

A 164-problem Python function-completion benchmark released by OpenAI in 2021. The pass@1 metric measures the share of problems solved on the first attempt.

Twelve frontier models now exceed 95% pass@1. Score differences inside the saturation band are largely noise rather than meaningful capability differences.

SWE-bench Verified (real GitHub issues), LiveCodeBench (contamination-resistant), BigCodeBench (multi-step), and agentic benchmarks like WebArena and OSWorld for coding-related tasks.

Yes for completeness, but rank order should not drive procurement decisions in 2026. SWE-bench Verified and LiveCodeBench scores discriminate frontier models more meaningfully.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.