Reasoning models with test-time compute scaling became the most important AI capability frontier in 2025-2026. Open-weight reasoning launched with DeepSeek-R1 in January 2025 and accelerated with QwQ-32B, Marco-o1, Skywork-OR1, Qwen3-Thinking, and the R1-distill family throughout 2025-2026. This page consolidates the leaderboard, the benchmarks, and the deployment guidance.
Key Findings
- DeepSeek-R1 remains the largest open-weight reasoning model at approximately 671B MoE parameters; the R1-distill family (distilled into Qwen and Llama backbones at 1.5B to 70B sizes) made reasoning accessible to consumer hardware.
- Qwen3-Thinking variants (4B, 8B, 14B, 32B, 235B-A22B MoE) released April 2026 lead the mid-size open-weight reasoning leaderboard with strong AIME, GPQA-Diamond, and LiveCodeBench scores.
- QwQ-32B from Alibaba (released late 2024, updated) remains a popular 32B reasoning model with competitive AIME performance at consumer GPU scale.
- Skywork-OR1 (Open Reasoner-1) from Kunlun is the strongest fully-open reasoning model with open weights, open recipe, and open RL training data.
- Marco-o1 from Alibaba is the most-used reasoning model in production agentic systems where the open weights and explicit reasoning traces support custom guardrail integration.
Open-Weight Reasoning Leaderboard (May 2026)
| Model | Parameters | AIME 2024 | GPQA-Diamond | License |
|---|---|---|---|---|
| DeepSeek-R1 | ~671B MoE (~37B active) | ~79.8 | ~71.5 | MIT |
| DeepSeek-R1-Distill-Llama-70B | ~70B | ~70.0 | ~65.2 | MIT |
| DeepSeek-R1-Distill-Qwen-32B | ~32B | ~72.6 | ~62.1 | MIT |
| DeepSeek-R1-Distill-Qwen-14B | ~14B | ~69.7 | ~59.1 | MIT |
| DeepSeek-R1-Distill-Qwen-7B | ~7B | ~55.5 | ~49.1 | MIT |
| Qwen3-235B-A22B (Thinking) | ~235B MoE (~22B active) | ~83.0 | ~71.1 | Apache 2.0 |
| Qwen3-32B (Thinking) | ~32B | ~78.5 | ~66.4 | Apache 2.0 |
| Qwen3-14B (Thinking) | ~14B | ~71.2 | ~58.0 | Apache 2.0 |
| Qwen3-8B (Thinking) | ~8B | ~62.4 | ~52.1 | Apache 2.0 |
| QwQ-32B | ~32B | ~63.6 | ~54.5 | Apache 2.0 |
| Marco-o1 | ~7B | ~51.5 | ~46.2 | Apache 2.0 |
| Skywork-OR1-32B | ~32B | ~77.1 | ~64.8 | Apache 2.0 |
| OpenThinker-32B | ~32B | ~66.7 | ~58.0 | Apache 2.0 |
| Sky-T1-32B | ~32B | ~57.0 | ~50.5 | Apache 2.0 |
Closed-Model Reasoning Reference
| Model | AIME 2024 | GPQA-Diamond |
|---|---|---|
| GPT-5.5 (high reasoning) | ~94.0 | ~85.7 |
| Claude 4.7 Opus | ~89.0 | ~88.0 |
| Gemini 3.1 Pro Deep Think | ~91.8 | ~83.1 |
| Grok 4 | ~85.5 | ~79.8 |
Use Case Recommendations
| Use Case | Recommended Model |
|---|---|
| Top reasoning quality | DeepSeek-R1 or Qwen3-235B-A22B (Thinking) |
| Reasoning on consumer GPU (24-48 GB) | DeepSeek-R1-Distill-Qwen-32B or Qwen3-32B (Thinking) |
| Math contest preparation | Qwen3-32B (Thinking) or Skywork-OR1 |
| Code with reasoning | Qwen3-32B (Thinking) or DeepSeek-R1-Distill-Qwen-32B |
| Agentic systems with explicit traces | Marco-o1 or Qwen3 Thinking |
| Edge / on-device reasoning | DeepSeek-R1-Distill-Qwen-7B or Qwen3-8B (Thinking) |
| Research and reproducibility | Skywork-OR1 (full open recipe) |
Inference Considerations
Three deployment patterns. First, reasoning models consume substantially more tokens per response than non-reasoning models, typically 5x to 20x; budget compute accordingly. Second, the test-time compute scaling is real: increasing the maximum reasoning trace length materially improves performance up to a model-specific saturation point. Third, the reasoning trace is often the most valuable output: agentic systems and tool-use frameworks benefit from explicit trace logging because it provides natural audit and guardrail surface.
Brand Visibility Implications
Reasoning model selection is the most-watched AI capability decision in 2026 for technical buyers. AI assistant queries about "best reasoning model open source", "DeepSeek-R1 vs QwQ", "Qwen3 Thinking adoption", and similar terms drive direct production decisions. Brands selling AI evaluation tools, agentic infrastructure, RL fine-tuning, and reasoning-specific tooling face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from primary model card disclosures, the Hugging Face Open LLM Leaderboard, and peer-reviewed reasoning benchmark publications through 23 May 2026. Updated quarterly with new reasoning model releases.
How Presenc AI Helps
Presenc AI monitors brand visibility on reasoning model queries across ChatGPT, Claude, Gemini, and Perplexity. For AI evaluation brands, agentic infrastructure vendors, RL finetuning services, and reasoning-specific tooling, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.