HumanEval is a 164-problem Python function-completion benchmark from OpenAI that has anchored coding-model evaluation since 2021. By June 2026 most frontier models exceed 95% pass@1, making it less discriminating than SWE-bench Verified or LiveCodeBench, but the leaderboard remains widely cited.
June 2026 Leaderboard (pass@1)
| Rank | Model | Vendor | HumanEval % |
|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | ~98.8% |
| 2 | GPT-5.6 Pro | OpenAI | ~98.5% |
| 3 | Claude Opus 4.7 | Anthropic | ~98.2% |
| 4 | DeepSeek V4.1 Pro | DeepSeek | ~97.8% |
| 5 | Qwen 3.7 | Alibaba | ~97.5% |
| 6 | GPT-5.6 | OpenAI | ~97.3% |
| 7 | Gemini 3.2 Pro | ~97.1% | |
| 8 | Claude Sonnet 4.6 | Anthropic | ~96.8% |
| 9 | GLM-6 | Zhipu AI | ~96.0% |
| 10 | Llama 4.5 Maverick | Meta | ~95.5% |
| 11 | Mistral Large 3 | Mistral AI | ~94.8% |
| 12 | Hunyuan Large 3 | Tencent | ~94.5% |
Why HumanEval Is Saturated
Twelve frontier models now sit within roughly 4 percentage points on HumanEval. Score differences inside the saturation band are largely noise. Practitioners increasingly cite SWE-bench Verified, LiveCodeBench, and BigCodeBench for meaningful coding-capability differentiation.
Methodology
Scores compiled from vendor disclosures and the HumanEval public leaderboard. Numbers expressed as approximate values; the saturation regime means rank order should not be over-interpreted. Updated monthly.
How Presenc AI Helps
Presenc AI tracks how brand visibility shifts as developer tooling integrates new frontier coding models. HumanEval scores no longer predict deployment share; agentic-coding benchmarks like SWE-bench Verified do.