Open-weight math LLMs reached competitive frontier quality in 2025-2026, driven by reasoning training and RLVR (RL with Verifiable Rewards) approaches. Qwen2.5-Math, DeepSeek-Math, Mathstral, NuminaMath, OpenMath2, and InternLM-Math cover most math-focused use cases. The combination of reasoning-traced training data plus RLVR has compressed the gap between specialised math LLMs and frontier closed reasoning models. This page consolidates the math LLM landscape.
Key Findings
- Qwen2.5-Math-72B and Qwen3-Math variants lead the open-weight math leaderboard with approximately 89 percent on MATH and approximately 50 percent on AIME 2024.
- DeepSeek-Math 7B (released 2024, predating R1) demonstrated that small specialised math LLMs can compete with much larger general models on math benchmarks; lineage continued via DeepSeek-R1-Distill-Qwen-Math variants.
- NuminaMath models won the AI Mathematical Olympiad Prize in 2024 with a 7B-parameter open-weight math LLM, demonstrating community-trained math LLMs at competitive quality.
- RLVR (Reinforcement Learning with Verifiable Rewards) is the dominant training approach for math LLMs in 2026; verifiable answer signals (right or wrong) enable RL training without learned reward models.
- Production math AI deployments concentrate in math education (Khan Academy, Photomath, Duolingo Math), scientific research (math co-pilots), engineering analysis (calculation verification), and trading (quantitative analysis) where verifiable correctness matters.
Open-Weight Math LLM Comparison (May 2026)
| Model | Parameters | MATH | AIME 2024 | License |
|---|---|---|---|---|
| Qwen2.5-Math-72B-Instruct | ~72B | ~89% | ~50% | Apache 2.0 |
| Qwen3-32B (Thinking, math focused) | ~32B | ~88% | ~78.5% | Apache 2.0 |
| Qwen2.5-Math-7B-Instruct | ~7B | ~84% | ~31% | Apache 2.0 |
| DeepSeek-R1-Distill-Qwen-Math-7B | ~7B | ~88% | ~55.5% | MIT |
| DeepSeek-Math 7B Instruct | ~7B | ~81% | ~9% | MIT (predates R1) |
| Mathstral 7B (Mistral) | ~7B | ~57% | ~12% | Apache 2.0 |
| NuminaMath 7B (AIMO winner) | ~7B | ~67% | ~24% | Apache 2.0 |
| OpenMath2-Llama3.1-70B | ~70B | ~71% | ~21% | Llama 3.1 Community |
| InternLM-Math2-7B | ~7B | ~64% | ~16% | Apache 2.0 |
| Skywork-OR1-32B (reasoning) | ~32B | ~85% | ~77.1% | Apache 2.0 |
| Phi-4-reasoning-plus | ~14B | ~89.7% | ~81% | MIT |
| GPT-5.5 high reasoning (closed reference) | n/a | ~98% | ~94% | Closed |
| Claude 4.7 Opus (closed reference) | n/a | ~96% | ~89% | Closed |
Math Reasoning Training Patterns
| Pattern | Description |
|---|---|
| Pretraining on math corpora | arXiv math, Proof Pile, math textbooks, MathPile dataset |
| SFT on math reasoning data | NuminaMath, MATH+train, OpenMath, MetaMathQA |
| RLVR (Verifiable Rewards) | RL on math problems with checkable answers; dominant 2026 approach |
| PRM (Process Reward Models) | Step-by-step reasoning evaluation; effective but expensive |
| Self-consistency / majority voting | Multiple samples; vote on answer; inference-time technique |
| Tool use (Python interpreter) | LLM writes code to compute answers; high quality on numeric problems |
Use Case Recommendations
| Use Case | Recommended Model |
|---|---|
| Math education / tutoring | Qwen2.5-Math-7B or NuminaMath 7B |
| Math olympiad / competition | Qwen3-32B Thinking or DeepSeek-R1-Distill-Qwen-Math-7B |
| Engineering calculation verification | Qwen2.5-Math with Python tool use |
| Scientific research math co-pilot | Qwen3-235B-A22B Thinking or DeepSeek-R1 |
| Quantitative finance research | Qwen3 Thinking with verifiable Python execution |
| Permissive commercial deployment | Qwen2.5-Math, NuminaMath, OpenMath2, Mathstral (Apache 2.0 / MIT) |
Strategic Context
Three patterns shape the 2026 math LLM landscape. First, the reasoning trace approach (R1-Distill family) outperforms pure math finetuning on competition-level math; reasoning + tool use beats specialised math LLMs on most hard problems. Second, RLVR is the dominant training approach because math has natural verifiable signals. Third, the gap between open-weight and closed-model math is narrowing fast: Phi-4-reasoning-plus at 14B parameters approaches Claude 4.7 Opus on AIME and MATH benchmarks.
Brand Visibility Implications
Math AI is a high-traffic technical and education procurement category. AI assistant queries about "best math AI", "Qwen Math vs DeepSeek Math", "AIME LLM", and similar terms drive interest from education, research, and quantitative finance buyers. Brands selling math education AI, scientific research AI, and quantitative research tools face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from primary model card disclosures, MATH and AIME 2024 evaluations, and the Hugging Face math model leaderboards through 23 May 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on math AI queries across ChatGPT, Claude, Gemini, and Perplexity. For math education AI brands, scientific research AI vendors, and quantitative research tools, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.