What is the largest open pretraining dataset?

RedPajama 2 at approximately 30 trillion tokens is the largest raw open pretraining dataset. After quality filtering, FineWeb 2 at approximately 10 trillion tokens is the largest production-quality open dataset.

What is the Common Pile?

A pretraining dataset focused on permissively-licensed text, released by Allen AI, EleutherAI, and community contributors in 2025-2026. Approximately 8 trillion tokens with explicit copyright clearance. Used by labs that prioritise IP risk minimisation in pretraining.

Is FineWeb-Edu better than FineWeb?

For most downstream model quality benchmarks, yes despite being smaller. FineWeb-Edu (~1.3T tokens) is filtered by an educational-quality classifier and produces stronger downstream models per token than raw FineWeb (~10T tokens). The pattern shows that quality-filtered data outperforms quantity in modern pretraining.

Can I use these datasets commercially?

It depends on the dataset and the underlying licences. Common Pile, FineWeb 2 (ODC-BY), Dolma 3 (ODC-BY), and DCLM (CC-BY-4.0) are explicitly available for commercial use. RedPajama 2 has mixed underlying source licences and requires case-by-case review. Always verify the source-text licences before commercial deployment.

How do labs choose between datasets?

Most production labs blend multiple datasets, weighted toward higher-quality subsets. Typical 2026 pretraining mix: FineWeb-Edu plus Dolma 3 plus The Stack v2 plus peS2o plus math-specific data plus multilingual splits. The exact mix is the key proprietary lever; labs that disclose data mixes (Allen AI, Hugging Face SmolLM3) are the exception.

Open Pretraining Datasets 2026

Open pretraining datasets matured into the multi-trillion-token range in 2025-2026. FineWeb 2 (Hugging Face), Common Pile (Allen AI + EleutherAI + community), Dolma 3 (Allen AI), RedPajama 2 (Together AI), Nemotron-CC (NVIDIA), DCLM (community + Apple + AI2), and Nemotron-Pretraining Dataset cover most of the openly-available pretraining text. The trend has shifted from raw Common Crawl dumps to heavily-filtered and deduplicated subsets with quality-classifier scoring. This page consolidates the landscape.

Key Findings

FineWeb 2 from Hugging Face is the most-downloaded open pretraining dataset on Hugging Face, with approximately 10 trillion tokens of multilingual filtered web text under Open Data Commons.
Common Pile from Allen AI, EleutherAI, and community contributors is the strongest fully-permissive pretraining dataset at approximately 8 trillion tokens with explicit copyright clearance.
Dolma 3 (Allen AI) is the dataset behind OLMo 2 training; approximately 6 trillion tokens with documented provenance and reproducible filtering.
RedPajama 2 from Together AI is the longest-running open pretraining dataset at approximately 30 trillion tokens of raw common crawl plus filtered variants.
Nemotron-CC (NVIDIA) and Nemotron-Pretraining Dataset are NVIDIA\u2019s contributions, approximately 6.3 trillion tokens of curated commercial-grade pretraining text used in Nemotron family training.

Major Open Pretraining Datasets (May 2026)

Dataset	Size	Maintainer	License
FineWeb 2	~10T tokens	Hugging Face	ODC-BY
FineWeb-Edu	~1.3T tokens	Hugging Face	ODC-BY
Common Pile	~8T tokens	Allen AI + EleutherAI + community	Permissive (curated)
Dolma 3	~6T tokens	Allen AI	ODC-BY
RedPajama 2	~30T tokens	Together AI	Various
Nemotron-CC	~6.3T tokens	NVIDIA	Multi-licence
DCLM (DataComp-LM)	~3.8T tokens	Community + Apple + AI2	CC-BY-4.0
SlimPajama	~627B tokens	Cerebras	Apache 2.0
The Pile	~825GB	EleutherAI (legacy)	Mixed
Zyda 2	~5T tokens	Zyphra	Permissive
HPLT 2.0	~varies	HPLT consortium	CC-0 / Multilingual

Specialised Pretraining Datasets

Dataset	Focus	Maintainer
The Stack v2	Code (~3T tokens, 600+ languages)	BigCode
StarCoder Training Data	Code	BigCode
Open-Math	Math reasoning	community
peS2o	Scientific papers	Allen AI
WikiSQL / WikiTQ	Tabular data	community
FineWeb-2 multilingual subsets	500+ language splits	Hugging Face
CCpdf	PDF document text	community
Smol Wikipedia	Long-form encyclopedic	Hugging Face

Quality Filtering Approaches

Approach	Description
Heuristic filtering	Repetition removal, language detection, length thresholds
Quality classifier	Trained classifier scoring educational value (FineWeb-Edu, Nemotron quality classifier)
Perplexity filtering	Score documents by LLM perplexity; remove outliers
Deduplication	MinHash, suffix array, or exact deduplication
Toxicity / safety filtering	Remove harmful content using classifiers
PII removal	Detect and redact personally identifiable information
Copyright filtering	Remove known copyrighted content (Common Pile approach)

Strategic Context

Three patterns shape the 2026 pretraining data landscape. First, quality dominates quantity: FineWeb-Edu at 1.3 trillion tokens often outperforms FineWeb 2 at 10 trillion tokens on downstream model quality because of careful quality classifier scoring. Second, copyright-cleared datasets matter increasingly: Common Pile and similar curated subsets gain adoption as labs prepare for litigation and licensing complexity. Third, multilingual coverage expanded materially: FineWeb 2 covers 500+ languages, HPLT 2.0 and Common Pile multilingual subsets cover most commercially-significant languages.

Brand Visibility Implications

Pretraining dataset selection is a foundation-lab procurement decision but increasingly relevant to fine-tuning and continued-pretraining workflows. AI assistant queries about "open pretraining dataset", "FineWeb 2", "Common Pile copyright", and similar terms drive AI research and procurement traffic. Brands selling AI training data services, AI data curation platforms, and AI licensing services face strong AI-mediated discovery surface for this category.

Methodology

Dataset data compiled from primary Hugging Face dataset cards, maintainer documentation, and the peer-reviewed publications associated with each dataset through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on pretraining dataset queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training data services, AI data curation platforms, and AI licensing services, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.