Research

Open Pretraining Datasets 2026

Open-source pretraining datasets 2026: FineWeb 2 (~10T), Common Pile (~8T), Dolma 3 (~6T), RedPajama 2 (~30T), Nemotron-CC (~6.3T), DCLM. Comparison, licensing, quality filters.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Open pretraining datasets matured into the multi-trillion-token range in 2025-2026. FineWeb 2 (Hugging Face), Common Pile (Allen AI + EleutherAI + community), Dolma 3 (Allen AI), RedPajama 2 (Together AI), Nemotron-CC (NVIDIA), DCLM (community + Apple + AI2), and Nemotron-Pretraining Dataset cover most of the openly-available pretraining text. The trend has shifted from raw Common Crawl dumps to heavily-filtered and deduplicated subsets with quality-classifier scoring. This page consolidates the landscape.

Key Findings

  1. FineWeb 2 from Hugging Face is the most-downloaded open pretraining dataset on Hugging Face, with approximately 10 trillion tokens of multilingual filtered web text under Open Data Commons.
  2. Common Pile from Allen AI, EleutherAI, and community contributors is the strongest fully-permissive pretraining dataset at approximately 8 trillion tokens with explicit copyright clearance.
  3. Dolma 3 (Allen AI) is the dataset behind OLMo 2 training; approximately 6 trillion tokens with documented provenance and reproducible filtering.
  4. RedPajama 2 from Together AI is the longest-running open pretraining dataset at approximately 30 trillion tokens of raw common crawl plus filtered variants.
  5. Nemotron-CC (NVIDIA) and Nemotron-Pretraining Dataset are NVIDIA\u2019s contributions, approximately 6.3 trillion tokens of curated commercial-grade pretraining text used in Nemotron family training.

Major Open Pretraining Datasets (May 2026)

DatasetSizeMaintainerLicense
FineWeb 2~10T tokensHugging FaceODC-BY
FineWeb-Edu~1.3T tokensHugging FaceODC-BY
Common Pile~8T tokensAllen AI + EleutherAI + communityPermissive (curated)
Dolma 3~6T tokensAllen AIODC-BY
RedPajama 2~30T tokensTogether AIVarious
Nemotron-CC~6.3T tokensNVIDIAMulti-licence
DCLM (DataComp-LM)~3.8T tokensCommunity + Apple + AI2CC-BY-4.0
SlimPajama~627B tokensCerebrasApache 2.0
The Pile~825GBEleutherAI (legacy)Mixed
Zyda 2~5T tokensZyphraPermissive
HPLT 2.0~variesHPLT consortiumCC-0 / Multilingual

Specialised Pretraining Datasets

DatasetFocusMaintainer
The Stack v2Code (~3T tokens, 600+ languages)BigCode
StarCoder Training DataCodeBigCode
Open-MathMath reasoningcommunity
peS2oScientific papersAllen AI
WikiSQL / WikiTQTabular datacommunity
FineWeb-2 multilingual subsets500+ language splitsHugging Face
CCpdfPDF document textcommunity
Smol WikipediaLong-form encyclopedicHugging Face

Quality Filtering Approaches

ApproachDescription
Heuristic filteringRepetition removal, language detection, length thresholds
Quality classifierTrained classifier scoring educational value (FineWeb-Edu, Nemotron quality classifier)
Perplexity filteringScore documents by LLM perplexity; remove outliers
DeduplicationMinHash, suffix array, or exact deduplication
Toxicity / safety filteringRemove harmful content using classifiers
PII removalDetect and redact personally identifiable information
Copyright filteringRemove known copyrighted content (Common Pile approach)

Strategic Context

Three patterns shape the 2026 pretraining data landscape. First, quality dominates quantity: FineWeb-Edu at 1.3 trillion tokens often outperforms FineWeb 2 at 10 trillion tokens on downstream model quality because of careful quality classifier scoring. Second, copyright-cleared datasets matter increasingly: Common Pile and similar curated subsets gain adoption as labs prepare for litigation and licensing complexity. Third, multilingual coverage expanded materially: FineWeb 2 covers 500+ languages, HPLT 2.0 and Common Pile multilingual subsets cover most commercially-significant languages.

Brand Visibility Implications

Pretraining dataset selection is a foundation-lab procurement decision but increasingly relevant to fine-tuning and continued-pretraining workflows. AI assistant queries about "open pretraining dataset", "FineWeb 2", "Common Pile copyright", and similar terms drive AI research and procurement traffic. Brands selling AI training data services, AI data curation platforms, and AI licensing services face strong AI-mediated discovery surface for this category.

Methodology

Dataset data compiled from primary Hugging Face dataset cards, maintainer documentation, and the peer-reviewed publications associated with each dataset through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on pretraining dataset queries across ChatGPT, Claude, Gemini, and Perplexity. For AI training data services, AI data curation platforms, and AI licensing services, the platform identifies the prompts driving research-traffic patterns and the gaps where new content unlocks share of voice.

Frequently Asked Questions

RedPajama 2 at approximately 30 trillion tokens is the largest raw open pretraining dataset. After quality filtering, FineWeb 2 at approximately 10 trillion tokens is the largest production-quality open dataset.
A pretraining dataset focused on permissively-licensed text, released by Allen AI, EleutherAI, and community contributors in 2025-2026. Approximately 8 trillion tokens with explicit copyright clearance. Used by labs that prioritise IP risk minimisation in pretraining.
For most downstream model quality benchmarks, yes despite being smaller. FineWeb-Edu (~1.3T tokens) is filtered by an educational-quality classifier and produces stronger downstream models per token than raw FineWeb (~10T tokens). The pattern shows that quality-filtered data outperforms quantity in modern pretraining.
It depends on the dataset and the underlying licences. Common Pile, FineWeb 2 (ODC-BY), Dolma 3 (ODC-BY), and DCLM (CC-BY-4.0) are explicitly available for commercial use. RedPajama 2 has mixed underlying source licences and requires case-by-case review. Always verify the source-text licences before commercial deployment.
Most production labs blend multiple datasets, weighted toward higher-quality subsets. Typical 2026 pretraining mix: FineWeb-Edu plus Dolma 3 plus The Stack v2 plus peS2o plus math-specific data plus multilingual splits. The exact mix is the key proprietary lever; labs that disclose data mixes (Allen AI, Hugging Face SmolLM3) are the exception.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.