Research

Open-Weight Math LLMs 2026

Open-weight math LLMs in 2026: Qwen2.5-Math, DeepSeek-Math, Mathstral, NuminaMath, OpenMath2, InternLM-Math. AIME, MATH, GSM8K benchmarks, RLVR training patterns.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Open-weight math LLMs reached competitive frontier quality in 2025-2026, driven by reasoning training and RLVR (RL with Verifiable Rewards) approaches. Qwen2.5-Math, DeepSeek-Math, Mathstral, NuminaMath, OpenMath2, and InternLM-Math cover most math-focused use cases. The combination of reasoning-traced training data plus RLVR has compressed the gap between specialised math LLMs and frontier closed reasoning models. This page consolidates the math LLM landscape.

Key Findings

  1. Qwen2.5-Math-72B and Qwen3-Math variants lead the open-weight math leaderboard with approximately 89 percent on MATH and approximately 50 percent on AIME 2024.
  2. DeepSeek-Math 7B (released 2024, predating R1) demonstrated that small specialised math LLMs can compete with much larger general models on math benchmarks; lineage continued via DeepSeek-R1-Distill-Qwen-Math variants.
  3. NuminaMath models won the AI Mathematical Olympiad Prize in 2024 with a 7B-parameter open-weight math LLM, demonstrating community-trained math LLMs at competitive quality.
  4. RLVR (Reinforcement Learning with Verifiable Rewards) is the dominant training approach for math LLMs in 2026; verifiable answer signals (right or wrong) enable RL training without learned reward models.
  5. Production math AI deployments concentrate in math education (Khan Academy, Photomath, Duolingo Math), scientific research (math co-pilots), engineering analysis (calculation verification), and trading (quantitative analysis) where verifiable correctness matters.

Open-Weight Math LLM Comparison (May 2026)

ModelParametersMATHAIME 2024License
Qwen2.5-Math-72B-Instruct~72B~89%~50%Apache 2.0
Qwen3-32B (Thinking, math focused)~32B~88%~78.5%Apache 2.0
Qwen2.5-Math-7B-Instruct~7B~84%~31%Apache 2.0
DeepSeek-R1-Distill-Qwen-Math-7B~7B~88%~55.5%MIT
DeepSeek-Math 7B Instruct~7B~81%~9%MIT (predates R1)
Mathstral 7B (Mistral)~7B~57%~12%Apache 2.0
NuminaMath 7B (AIMO winner)~7B~67%~24%Apache 2.0
OpenMath2-Llama3.1-70B~70B~71%~21%Llama 3.1 Community
InternLM-Math2-7B~7B~64%~16%Apache 2.0
Skywork-OR1-32B (reasoning)~32B~85%~77.1%Apache 2.0
Phi-4-reasoning-plus~14B~89.7%~81%MIT
GPT-5.5 high reasoning (closed reference)n/a~98%~94%Closed
Claude 4.7 Opus (closed reference)n/a~96%~89%Closed

Math Reasoning Training Patterns

PatternDescription
Pretraining on math corporaarXiv math, Proof Pile, math textbooks, MathPile dataset
SFT on math reasoning dataNuminaMath, MATH+train, OpenMath, MetaMathQA
RLVR (Verifiable Rewards)RL on math problems with checkable answers; dominant 2026 approach
PRM (Process Reward Models)Step-by-step reasoning evaluation; effective but expensive
Self-consistency / majority votingMultiple samples; vote on answer; inference-time technique
Tool use (Python interpreter)LLM writes code to compute answers; high quality on numeric problems

Use Case Recommendations

Use CaseRecommended Model
Math education / tutoringQwen2.5-Math-7B or NuminaMath 7B
Math olympiad / competitionQwen3-32B Thinking or DeepSeek-R1-Distill-Qwen-Math-7B
Engineering calculation verificationQwen2.5-Math with Python tool use
Scientific research math co-pilotQwen3-235B-A22B Thinking or DeepSeek-R1
Quantitative finance researchQwen3 Thinking with verifiable Python execution
Permissive commercial deploymentQwen2.5-Math, NuminaMath, OpenMath2, Mathstral (Apache 2.0 / MIT)

Strategic Context

Three patterns shape the 2026 math LLM landscape. First, the reasoning trace approach (R1-Distill family) outperforms pure math finetuning on competition-level math; reasoning + tool use beats specialised math LLMs on most hard problems. Second, RLVR is the dominant training approach because math has natural verifiable signals. Third, the gap between open-weight and closed-model math is narrowing fast: Phi-4-reasoning-plus at 14B parameters approaches Claude 4.7 Opus on AIME and MATH benchmarks.

Brand Visibility Implications

Math AI is a high-traffic technical and education procurement category. AI assistant queries about "best math AI", "Qwen Math vs DeepSeek Math", "AIME LLM", and similar terms drive interest from education, research, and quantitative finance buyers. Brands selling math education AI, scientific research AI, and quantitative research tools face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from primary model card disclosures, MATH and AIME 2024 evaluations, and the Hugging Face math model leaderboards through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on math AI queries across ChatGPT, Claude, Gemini, and Perplexity. For math education AI brands, scientific research AI vendors, and quantitative research tools, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

For pure math benchmarks (MATH), Qwen2.5-Math-72B leads at approximately 89 percent. For AIME 2024 competition math, Phi-4-reasoning-plus 14B leads at approximately 81 percent and Qwen3-32B (Thinking) at approximately 78.5 percent. For small footprint, DeepSeek-R1-Distill-Qwen-Math-7B is the strongest 7B math choice.
For competition-level math, reasoning models (Qwen3 Thinking, DeepSeek-R1, Phi-4-reasoning-plus) tend to outperform specialised math LLMs by a meaningful margin because reasoning approaches generalise better. Specialised math LLMs remain useful for cost-efficient routine math, education, and engineering calculation workloads.
Reinforcement Learning with Verifiable Rewards uses rule-based reward signals (e.g., math correctness) instead of learned reward models. Math is the ideal RLVR domain because answers are verifiable. RLVR is the dominant training approach for math LLMs in 2026 and has driven much of the recent quality improvement.
Not yet on the hardest benchmarks. GPT-5.5 with high reasoning scores approximately 94 percent on AIME 2024 vs Phi-4-reasoning-plus at approximately 81 percent and Qwen3-32B (Thinking) at approximately 78.5 percent. The gap is closing but remains real for frontier-level math reasoning.
For numerical computation and engineering calculation, general LLM plus Python tool use is often the strongest approach because verifiable Python execution eliminates arithmetic errors. For symbolic math and proof reasoning, specialised math LLMs or reasoning LLMs are the better choice. Many production deployments use both: reasoning LLM for problem decomposition, Python for verification.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.