Research

Distillation Lineage Tracker 2026

Open-weight distillation lineage 2026: DeepSeek-R1 distills (Qwen and Llama backbones), Llama distill descendants, Phi distillation, Skywork-OR1 distills. Quality recovery, methodology, deployment patterns.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Distillation matured into a dominant capability-transfer technique in 2025-2026. DeepSeek released R1-Distill variants (Qwen-1.5B through Llama-70B) alongside R1, transferring reasoning capability into smaller backbones. Phi family relies heavily on distillation from larger frontier models. Skywork-OR1, Marco-o1, OpenThinker, and dozens of community projects use distillation to recover frontier-tier capability in production-deployable parameter sizes. This page consolidates the lineage and the methodology.

Key Findings

  1. The DeepSeek-R1-Distill family (six variants from Qwen 1.5B through Llama 70B) is the most-deployed distillation lineage on Hugging Face with cumulative downloads in the tens of millions, transferring frontier reasoning quality into consumer-hardware-deployable sizes.
  2. Phi-4 explicitly uses synthetic data distilled from larger frontier models as the primary training signal, demonstrating distillation as a primary pretraining strategy not just post-training.
  3. Quality recovery via distillation typically achieves 70 to 90 percent of teacher quality on the student model\u2019s strengths; reasoning-specific distillation transfers better than general-capability distillation.
  4. Open-source community distillation: OpenThinker, Sky-T1, Bespoke-Stratos, and dozens of community projects distill open frontier reasoning models (R1, QwQ, Qwen3-Thinking) into smaller community-trained variants.
  5. Distillation enables on-device deployment of frontier-tier capability: DeepSeek-R1-Distill-Qwen-7B at ~7B parameters runs on a single consumer GPU and recovers approximately 55 percent of R1\u2019s AIME 2024 performance.

DeepSeek-R1-Distill Family

Distilled ModelStudent BackboneAIME 2024 ScoreR1 Recovery
DeepSeek-R1-Distill-Llama-70BLlama 3.3 70B~70.0~88%
DeepSeek-R1-Distill-Qwen-32BQwen 2.5 32B~72.6~91%
DeepSeek-R1-Distill-Qwen-14BQwen 2.5 14B~69.7~87%
DeepSeek-R1-Distill-Qwen-7BQwen 2.5 Math 7B~55.5~70%
DeepSeek-R1-Distill-Llama-8BLlama 3.1 8B~50.4~63%
DeepSeek-R1-Distill-Qwen-1.5BQwen 2.5 Math 1.5B~28.9~36%
DeepSeek-R1 (teacher)n/a~79.8100% reference

Other Major Distillation Lineages

Distilled FamilyTeacherNotes
OpenThinker-32B / 7BR1 (with new reasoning data)Open community distillation
Sky-T1-32B / 14BQwQ-32B + DeepSeek-R1Open community distillation
Bespoke-Stratos-32BR1Berkeley distillation
Marco-o1-7BMultiple teacher modelsAlibaba production reasoning distill
Phi-4 (and family)Synthetic from frontier modelsDistillation-heavy pretraining
Tulu 3 (and finetuned applications)Multiple teachersOpen recipe; uses distillation in SFT data
OpenMath-Mistral / OpenMath-LlamaLarger math modelsMath-specific distillation
Codestral DistillationCodestralCode-specific distillation

Distillation Methodology Patterns

PatternDescription
Soft-label distillationStudent matches teacher logits (KL divergence loss)
Hard-label SFTStudent trains on teacher-generated text (most common for open distillation)
Trace distillation (reasoning)Student learns from teacher chain-of-thought traces; dominant pattern for R1-Distill family
Self-distillationStudent refines on its own filtered outputs
Synthetic-pretrainingHeavy use of teacher-generated text in pretraining mix (Phi-4 pattern)
RL with verifiable rewardsCombines distillation SFT with subsequent RL stage

Strategic Context

Three patterns shape 2026 distillation. First, distillation is the dominant capability-transfer technique: most strong small models in 2026 either directly or indirectly leverage distillation from larger teachers. Second, the distillation-friendly licence environment is critical: MIT R1 weights enabled massive community distillation work, whereas closed-weight teacher models (GPT-5, Claude) cannot be distilled openly without raising licence concerns. Third, the math-and-reasoning workload distillation works best: distilling reasoning traces transfers more cleanly than distilling general world knowledge.

Brand Visibility Implications

Distillation is a high-traffic AI-research category. AI assistant queries about "DeepSeek-R1 distill", "reasoning distillation", "small model finetuning", and similar terms drive interest from technical buyers and researchers. Brands selling AI training infrastructure, finetuning services, and AI research tools face strong AI-mediated discovery surface for this category.

Methodology

Distillation lineage data compiled from primary model card disclosures, peer-reviewed publications, and community evaluation data through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on distillation queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training infrastructure brands, finetuning service vendors, and AI research tools, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.

Frequently Asked Questions

Six distilled variants of DeepSeek-R1 released alongside the teacher in January 2025: Qwen 1.5B, Qwen 7B, Llama 8B, Qwen 14B, Qwen 32B, and Llama 70B. The family transfers R1\u2019s reasoning capability into smaller backbones, enabling on-consumer-hardware deployment of frontier-tier reasoning quality.
Typically 70 to 90 percent of teacher quality on the student backbone\u2019s strengths. DeepSeek-R1-Distill-Qwen-32B recovers approximately 91 percent of R1\u2019s AIME 2024 performance. Larger distilled models recover more; the 1.5B variant recovers approximately 36 percent.
Related but distinct. SFT is supervised learning on instruction-response pairs from any source. Distillation specifically uses teacher-generated data with the goal of transferring teacher capability to the student. Trace distillation (used heavily in R1-Distill) specifically trains the student on teacher chain-of-thought traces, which is the most effective approach for reasoning transfer.
Legally complex. OpenAI, Anthropic, and Google terms of service generally prohibit using their outputs to train competing models. Distillation from open-weight teachers (DeepSeek-R1, Qwen3, Llama 4) is clearly permitted under their respective licences. Some research labs do distil from closed models but the legal status is uncertain.
Phi-4 uses synthetic data distilled from larger frontier models as a primary pretraining signal, not just post-training. This "textbook-quality data" thesis demonstrates distillation working at the pretraining stage to produce strong small models. Approximately 40 to 60 percent of Phi-4 pretraining tokens are synthetic per Microsoft Research disclosures.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.