What is the best open-weight reasoning model?

DeepSeek-R1 at approximately 671B MoE is the largest and strongest open-weight reasoning model. For consumer hardware, Qwen3-32B (Thinking) and DeepSeek-R1-Distill-Qwen-32B are the leading mid-size choices. Qwen3-235B-A22B (Thinking) leads the open-weight AIME 2024 leaderboard at approximately 83 percent.

Are open-weight reasoning models competitive with GPT-5?

Closing fast. GPT-5.5 at high reasoning scores approximately 94 percent on AIME 2024 vs Qwen3-235B-A22B (Thinking) at approximately 83 percent. The gap on GPQA-Diamond is wider (~85.7 vs ~71.1). On most reasoning tasks the leading open weights cover the high-value middle of the distribution at a fraction of the cost.

Can I run a reasoning model on a single GPU?

Yes for the distilled and mid-size models. DeepSeek-R1-Distill-Qwen-32B and Qwen3-32B (Thinking) fit on a single 80 GB H100 in FP16. The 14B and 7B variants fit on a 24 GB RTX 4090. The 671B DeepSeek-R1 and 235B Qwen3 MoE require multi-GPU or aggressive quantization.

What is the difference between DeepSeek-R1 and QwQ?

DeepSeek-R1 is a 671B MoE model with native reasoning training; QwQ-32B is a smaller dense model focused on reasoning. R1 leads on most benchmarks but is harder to deploy. QwQ-32B is more accessible and was the first major non-Chinese-only reasoning open-weight release.

Do reasoning models cost more to run?

Yes meaningfully. Reasoning models consume 5x to 20x more tokens per response than non-reasoning models because of explicit chain-of-thought generation. Production deployments should budget for the longer outputs and may use mixed-routing (reasoning models for hard queries, fast models for routine) to control cost.

Best Open-Weight Reasoning Models 2026

Reasoning models with test-time compute scaling became the most important AI capability frontier in 2025-2026. Open-weight reasoning launched with DeepSeek-R1 in January 2025 and accelerated with QwQ-32B, Marco-o1, Skywork-OR1, Qwen3-Thinking, and the R1-distill family throughout 2025-2026. This page consolidates the leaderboard, the benchmarks, and the deployment guidance.

Key Findings

DeepSeek-R1 remains the largest open-weight reasoning model at approximately 671B MoE parameters; the R1-distill family (distilled into Qwen and Llama backbones at 1.5B to 70B sizes) made reasoning accessible to consumer hardware.
Qwen3-Thinking variants (4B, 8B, 14B, 32B, 235B-A22B MoE) released April 2026 lead the mid-size open-weight reasoning leaderboard with strong AIME, GPQA-Diamond, and LiveCodeBench scores.
QwQ-32B from Alibaba (released late 2024, updated) remains a popular 32B reasoning model with competitive AIME performance at consumer GPU scale.
Skywork-OR1 (Open Reasoner-1) from Kunlun is the strongest fully-open reasoning model with open weights, open recipe, and open RL training data.
Marco-o1 from Alibaba is the most-used reasoning model in production agentic systems where the open weights and explicit reasoning traces support custom guardrail integration.

Open-Weight Reasoning Leaderboard (May 2026)

Model	Parameters	AIME 2024	GPQA-Diamond	License
DeepSeek-R1	~671B MoE (~37B active)	~79.8	~71.5	MIT
DeepSeek-R1-Distill-Llama-70B	~70B	~70.0	~65.2	MIT
DeepSeek-R1-Distill-Qwen-32B	~32B	~72.6	~62.1	MIT
DeepSeek-R1-Distill-Qwen-14B	~14B	~69.7	~59.1	MIT
DeepSeek-R1-Distill-Qwen-7B	~7B	~55.5	~49.1	MIT
Qwen3-235B-A22B (Thinking)	~235B MoE (~22B active)	~83.0	~71.1	Apache 2.0
Qwen3-32B (Thinking)	~32B	~78.5	~66.4	Apache 2.0
Qwen3-14B (Thinking)	~14B	~71.2	~58.0	Apache 2.0
Qwen3-8B (Thinking)	~8B	~62.4	~52.1	Apache 2.0
QwQ-32B	~32B	~63.6	~54.5	Apache 2.0
Marco-o1	~7B	~51.5	~46.2	Apache 2.0
Skywork-OR1-32B	~32B	~77.1	~64.8	Apache 2.0
OpenThinker-32B	~32B	~66.7	~58.0	Apache 2.0
Sky-T1-32B	~32B	~57.0	~50.5	Apache 2.0

Closed-Model Reasoning Reference

Model	AIME 2024	GPQA-Diamond
GPT-5.5 (high reasoning)	~94.0	~85.7
Claude 4.7 Opus	~89.0	~88.0
Gemini 3.1 Pro Deep Think	~91.8	~83.1
Grok 4	~85.5	~79.8

Use Case Recommendations

Use Case	Recommended Model
Top reasoning quality	DeepSeek-R1 or Qwen3-235B-A22B (Thinking)
Reasoning on consumer GPU (24-48 GB)	DeepSeek-R1-Distill-Qwen-32B or Qwen3-32B (Thinking)
Math contest preparation	Qwen3-32B (Thinking) or Skywork-OR1
Code with reasoning	Qwen3-32B (Thinking) or DeepSeek-R1-Distill-Qwen-32B
Agentic systems with explicit traces	Marco-o1 or Qwen3 Thinking
Edge / on-device reasoning	DeepSeek-R1-Distill-Qwen-7B or Qwen3-8B (Thinking)
Research and reproducibility	Skywork-OR1 (full open recipe)

Inference Considerations

Three deployment patterns. First, reasoning models consume substantially more tokens per response than non-reasoning models, typically 5x to 20x; budget compute accordingly. Second, the test-time compute scaling is real: increasing the maximum reasoning trace length materially improves performance up to a model-specific saturation point. Third, the reasoning trace is often the most valuable output: agentic systems and tool-use frameworks benefit from explicit trace logging because it provides natural audit and guardrail surface.

Brand Visibility Implications

Reasoning model selection is the most-watched AI capability decision in 2026 for technical buyers. AI assistant queries about "best reasoning model open source", "DeepSeek-R1 vs QwQ", "Qwen3 Thinking adoption", and similar terms drive direct production decisions. Brands selling AI evaluation tools, agentic infrastructure, RL fine-tuning, and reasoning-specific tooling face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from primary model card disclosures, the Hugging Face Open LLM Leaderboard, and peer-reviewed reasoning benchmark publications through 23 May 2026. Updated quarterly with new reasoning model releases.

How Presenc AI Helps

Presenc AI monitors brand visibility on reasoning model queries across ChatGPT, Claude, Gemini, and Perplexity. For AI evaluation brands, agentic infrastructure vendors, RL finetuning services, and reasoning-specific tooling, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.