What is the R1-Distill family?

Six distilled variants of DeepSeek-R1 released alongside the teacher in January 2025: Qwen 1.5B, Qwen 7B, Llama 8B, Qwen 14B, Qwen 32B, and Llama 70B. The family transfers R1\u2019s reasoning capability into smaller backbones, enabling on-consumer-hardware deployment of frontier-tier reasoning quality.

How well does distillation transfer quality?

Typically 70 to 90 percent of teacher quality on the student backbone\u2019s strengths. DeepSeek-R1-Distill-Qwen-32B recovers approximately 91 percent of R1\u2019s AIME 2024 performance. Larger distilled models recover more; the 1.5B variant recovers approximately 36 percent.

Is distillation just supervised fine-tuning?

Related but distinct. SFT is supervised learning on instruction-response pairs from any source. Distillation specifically uses teacher-generated data with the goal of transferring teacher capability to the student. Trace distillation (used heavily in R1-Distill) specifically trains the student on teacher chain-of-thought traces, which is the most effective approach for reasoning transfer.

Can I distill from closed models like GPT-5?

Legally complex. OpenAI, Anthropic, and Google terms of service generally prohibit using their outputs to train competing models. Distillation from open-weight teachers (DeepSeek-R1, Qwen3, Llama 4) is clearly permitted under their respective licences. Some research labs do distil from closed models but the legal status is uncertain.

How does Phi-4 use distillation?

Phi-4 uses synthetic data distilled from larger frontier models as a primary pretraining signal, not just post-training. This "textbook-quality data" thesis demonstrates distillation working at the pretraining stage to produce strong small models. Approximately 40 to 60 percent of Phi-4 pretraining tokens are synthetic per Microsoft Research disclosures.

Distillation Lineage Tracker 2026

Distillation matured into a dominant capability-transfer technique in 2025-2026. DeepSeek released R1-Distill variants (Qwen-1.5B through Llama-70B) alongside R1, transferring reasoning capability into smaller backbones. Phi family relies heavily on distillation from larger frontier models. Skywork-OR1, Marco-o1, OpenThinker, and dozens of community projects use distillation to recover frontier-tier capability in production-deployable parameter sizes. This page consolidates the lineage and the methodology.

Key Findings

The DeepSeek-R1-Distill family (six variants from Qwen 1.5B through Llama 70B) is the most-deployed distillation lineage on Hugging Face with cumulative downloads in the tens of millions, transferring frontier reasoning quality into consumer-hardware-deployable sizes.
Phi-4 explicitly uses synthetic data distilled from larger frontier models as the primary training signal, demonstrating distillation as a primary pretraining strategy not just post-training.
Quality recovery via distillation typically achieves 70 to 90 percent of teacher quality on the student model\u2019s strengths; reasoning-specific distillation transfers better than general-capability distillation.
Open-source community distillation: OpenThinker, Sky-T1, Bespoke-Stratos, and dozens of community projects distill open frontier reasoning models (R1, QwQ, Qwen3-Thinking) into smaller community-trained variants.
Distillation enables on-device deployment of frontier-tier capability: DeepSeek-R1-Distill-Qwen-7B at ~7B parameters runs on a single consumer GPU and recovers approximately 55 percent of R1\u2019s AIME 2024 performance.

DeepSeek-R1-Distill Family

Distilled Model	Student Backbone	AIME 2024 Score	R1 Recovery
DeepSeek-R1-Distill-Llama-70B	Llama 3.3 70B	~70.0	~88%
DeepSeek-R1-Distill-Qwen-32B	Qwen 2.5 32B	~72.6	~91%
DeepSeek-R1-Distill-Qwen-14B	Qwen 2.5 14B	~69.7	~87%
DeepSeek-R1-Distill-Qwen-7B	Qwen 2.5 Math 7B	~55.5	~70%
DeepSeek-R1-Distill-Llama-8B	Llama 3.1 8B	~50.4	~63%
DeepSeek-R1-Distill-Qwen-1.5B	Qwen 2.5 Math 1.5B	~28.9	~36%
DeepSeek-R1 (teacher)	n/a	~79.8	100% reference

Other Major Distillation Lineages

Distilled Family	Teacher	Notes
OpenThinker-32B / 7B	R1 (with new reasoning data)	Open community distillation
Sky-T1-32B / 14B	QwQ-32B + DeepSeek-R1	Open community distillation
Bespoke-Stratos-32B	R1	Berkeley distillation
Marco-o1-7B	Multiple teacher models	Alibaba production reasoning distill
Phi-4 (and family)	Synthetic from frontier models	Distillation-heavy pretraining
Tulu 3 (and finetuned applications)	Multiple teachers	Open recipe; uses distillation in SFT data
OpenMath-Mistral / OpenMath-Llama	Larger math models	Math-specific distillation
Codestral Distillation	Codestral	Code-specific distillation

Distillation Methodology Patterns

Pattern	Description
Soft-label distillation	Student matches teacher logits (KL divergence loss)
Hard-label SFT	Student trains on teacher-generated text (most common for open distillation)
Trace distillation (reasoning)	Student learns from teacher chain-of-thought traces; dominant pattern for R1-Distill family
Self-distillation	Student refines on its own filtered outputs
Synthetic-pretraining	Heavy use of teacher-generated text in pretraining mix (Phi-4 pattern)
RL with verifiable rewards	Combines distillation SFT with subsequent RL stage

Strategic Context

Three patterns shape 2026 distillation. First, distillation is the dominant capability-transfer technique: most strong small models in 2026 either directly or indirectly leverage distillation from larger teachers. Second, the distillation-friendly licence environment is critical: MIT R1 weights enabled massive community distillation work, whereas closed-weight teacher models (GPT-5, Claude) cannot be distilled openly without raising licence concerns. Third, the math-and-reasoning workload distillation works best: distilling reasoning traces transfers more cleanly than distilling general world knowledge.

Brand Visibility Implications

Distillation is a high-traffic AI-research category. AI assistant queries about "DeepSeek-R1 distill", "reasoning distillation", "small model finetuning", and similar terms drive interest from technical buyers and researchers. Brands selling AI training infrastructure, finetuning services, and AI research tools face strong AI-mediated discovery surface for this category.

Methodology

Distillation lineage data compiled from primary model card disclosures, peer-reviewed publications, and community evaluation data through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on distillation queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training infrastructure brands, finetuning service vendors, and AI research tools, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.