Open pretraining datasets matured into the multi-trillion-token range in 2025-2026. FineWeb 2 (Hugging Face), Common Pile (Allen AI + EleutherAI + community), Dolma 3 (Allen AI), RedPajama 2 (Together AI), Nemotron-CC (NVIDIA), DCLM (community + Apple + AI2), and Nemotron-Pretraining Dataset cover most of the openly-available pretraining text. The trend has shifted from raw Common Crawl dumps to heavily-filtered and deduplicated subsets with quality-classifier scoring. This page consolidates the landscape.
Key Findings
- FineWeb 2 from Hugging Face is the most-downloaded open pretraining dataset on Hugging Face, with approximately 10 trillion tokens of multilingual filtered web text under Open Data Commons.
- Common Pile from Allen AI, EleutherAI, and community contributors is the strongest fully-permissive pretraining dataset at approximately 8 trillion tokens with explicit copyright clearance.
- Dolma 3 (Allen AI) is the dataset behind OLMo 2 training; approximately 6 trillion tokens with documented provenance and reproducible filtering.
- RedPajama 2 from Together AI is the longest-running open pretraining dataset at approximately 30 trillion tokens of raw common crawl plus filtered variants.
- Nemotron-CC (NVIDIA) and Nemotron-Pretraining Dataset are NVIDIA\u2019s contributions, approximately 6.3 trillion tokens of curated commercial-grade pretraining text used in Nemotron family training.
Major Open Pretraining Datasets (May 2026)
| Dataset | Size | Maintainer | License |
|---|---|---|---|
| FineWeb 2 | ~10T tokens | Hugging Face | ODC-BY |
| FineWeb-Edu | ~1.3T tokens | Hugging Face | ODC-BY |
| Common Pile | ~8T tokens | Allen AI + EleutherAI + community | Permissive (curated) |
| Dolma 3 | ~6T tokens | Allen AI | ODC-BY |
| RedPajama 2 | ~30T tokens | Together AI | Various |
| Nemotron-CC | ~6.3T tokens | NVIDIA | Multi-licence |
| DCLM (DataComp-LM) | ~3.8T tokens | Community + Apple + AI2 | CC-BY-4.0 |
| SlimPajama | ~627B tokens | Cerebras | Apache 2.0 |
| The Pile | ~825GB | EleutherAI (legacy) | Mixed |
| Zyda 2 | ~5T tokens | Zyphra | Permissive |
| HPLT 2.0 | ~varies | HPLT consortium | CC-0 / Multilingual |
Specialised Pretraining Datasets
| Dataset | Focus | Maintainer |
|---|---|---|
| The Stack v2 | Code (~3T tokens, 600+ languages) | BigCode |
| StarCoder Training Data | Code | BigCode |
| Open-Math | Math reasoning | community |
| peS2o | Scientific papers | Allen AI |
| WikiSQL / WikiTQ | Tabular data | community |
| FineWeb-2 multilingual subsets | 500+ language splits | Hugging Face |
| CCpdf | PDF document text | community |
| Smol Wikipedia | Long-form encyclopedic | Hugging Face |
Quality Filtering Approaches
| Approach | Description |
|---|---|
| Heuristic filtering | Repetition removal, language detection, length thresholds |
| Quality classifier | Trained classifier scoring educational value (FineWeb-Edu, Nemotron quality classifier) |
| Perplexity filtering | Score documents by LLM perplexity; remove outliers |
| Deduplication | MinHash, suffix array, or exact deduplication |
| Toxicity / safety filtering | Remove harmful content using classifiers |
| PII removal | Detect and redact personally identifiable information |
| Copyright filtering | Remove known copyrighted content (Common Pile approach) |
Strategic Context
Three patterns shape the 2026 pretraining data landscape. First, quality dominates quantity: FineWeb-Edu at 1.3 trillion tokens often outperforms FineWeb 2 at 10 trillion tokens on downstream model quality because of careful quality classifier scoring. Second, copyright-cleared datasets matter increasingly: Common Pile and similar curated subsets gain adoption as labs prepare for litigation and licensing complexity. Third, multilingual coverage expanded materially: FineWeb 2 covers 500+ languages, HPLT 2.0 and Common Pile multilingual subsets cover most commercially-significant languages.
Brand Visibility Implications
Pretraining dataset selection is a foundation-lab procurement decision but increasingly relevant to fine-tuning and continued-pretraining workflows. AI assistant queries about "open pretraining dataset", "FineWeb 2", "Common Pile copyright", and similar terms drive AI research and procurement traffic. Brands selling AI training data services, AI data curation platforms, and AI licensing services face strong AI-mediated discovery surface for this category.
Methodology
Dataset data compiled from primary Hugging Face dataset cards, maintainer documentation, and the peer-reviewed publications associated with each dataset through 23 May 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on pretraining dataset queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training data services, AI data curation platforms, and AI licensing services, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.