Distillation matured into a dominant capability-transfer technique in 2025-2026. DeepSeek released R1-Distill variants (Qwen-1.5B through Llama-70B) alongside R1, transferring reasoning capability into smaller backbones. Phi family relies heavily on distillation from larger frontier models. Skywork-OR1, Marco-o1, OpenThinker, and dozens of community projects use distillation to recover frontier-tier capability in production-deployable parameter sizes. This page consolidates the lineage and the methodology.
Key Findings
- The DeepSeek-R1-Distill family (six variants from Qwen 1.5B through Llama 70B) is the most-deployed distillation lineage on Hugging Face with cumulative downloads in the tens of millions, transferring frontier reasoning quality into consumer-hardware-deployable sizes.
- Phi-4 explicitly uses synthetic data distilled from larger frontier models as the primary training signal, demonstrating distillation as a primary pretraining strategy not just post-training.
- Quality recovery via distillation typically achieves 70 to 90 percent of teacher quality on the student model\u2019s strengths; reasoning-specific distillation transfers better than general-capability distillation.
- Open-source community distillation: OpenThinker, Sky-T1, Bespoke-Stratos, and dozens of community projects distill open frontier reasoning models (R1, QwQ, Qwen3-Thinking) into smaller community-trained variants.
- Distillation enables on-device deployment of frontier-tier capability: DeepSeek-R1-Distill-Qwen-7B at ~7B parameters runs on a single consumer GPU and recovers approximately 55 percent of R1\u2019s AIME 2024 performance.
DeepSeek-R1-Distill Family
| Distilled Model | Student Backbone | AIME 2024 Score | R1 Recovery |
|---|---|---|---|
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3 70B | ~70.0 | ~88% |
| DeepSeek-R1-Distill-Qwen-32B | Qwen 2.5 32B | ~72.6 | ~91% |
| DeepSeek-R1-Distill-Qwen-14B | Qwen 2.5 14B | ~69.7 | ~87% |
| DeepSeek-R1-Distill-Qwen-7B | Qwen 2.5 Math 7B | ~55.5 | ~70% |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1 8B | ~50.4 | ~63% |
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen 2.5 Math 1.5B | ~28.9 | ~36% |
| DeepSeek-R1 (teacher) | n/a | ~79.8 | 100% reference |
Other Major Distillation Lineages
| Distilled Family | Teacher | Notes |
|---|---|---|
| OpenThinker-32B / 7B | R1 (with new reasoning data) | Open community distillation |
| Sky-T1-32B / 14B | QwQ-32B + DeepSeek-R1 | Open community distillation |
| Bespoke-Stratos-32B | R1 | Berkeley distillation |
| Marco-o1-7B | Multiple teacher models | Alibaba production reasoning distill |
| Phi-4 (and family) | Synthetic from frontier models | Distillation-heavy pretraining |
| Tulu 3 (and finetuned applications) | Multiple teachers | Open recipe; uses distillation in SFT data |
| OpenMath-Mistral / OpenMath-Llama | Larger math models | Math-specific distillation |
| Codestral Distillation | Codestral | Code-specific distillation |
Distillation Methodology Patterns
| Pattern | Description |
|---|---|
| Soft-label distillation | Student matches teacher logits (KL divergence loss) |
| Hard-label SFT | Student trains on teacher-generated text (most common for open distillation) |
| Trace distillation (reasoning) | Student learns from teacher chain-of-thought traces; dominant pattern for R1-Distill family |
| Self-distillation | Student refines on its own filtered outputs |
| Synthetic-pretraining | Heavy use of teacher-generated text in pretraining mix (Phi-4 pattern) |
| RL with verifiable rewards | Combines distillation SFT with subsequent RL stage |
Strategic Context
Three patterns shape 2026 distillation. First, distillation is the dominant capability-transfer technique: most strong small models in 2026 either directly or indirectly leverage distillation from larger teachers. Second, the distillation-friendly licence environment is critical: MIT R1 weights enabled massive community distillation work, whereas closed-weight teacher models (GPT-5, Claude) cannot be distilled openly without raising licence concerns. Third, the math-and-reasoning workload distillation works best: distilling reasoning traces transfers more cleanly than distilling general world knowledge.
Brand Visibility Implications
Distillation is a high-traffic AI-research category. AI assistant queries about "DeepSeek-R1 distill", "reasoning distillation", "small model finetuning", and similar terms drive interest from technical buyers and researchers. Brands selling AI training infrastructure, finetuning services, and AI research tools face strong AI-mediated discovery surface for this category.
Methodology
Distillation lineage data compiled from primary model card disclosures, peer-reviewed publications, and community evaluation data through 23 May 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on distillation queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training infrastructure brands, finetuning service vendors, and AI research tools, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.