Training Data Curation: Web Filtering, Deduplication, and Quality Selection
Common Crawl contains 400B+ tokens of raw web text; quality filtering (perplexity scoring, deduplication, URL filtering) retains ~5–20% as training data; Penedo et al. (2023) FineWeb showed filtered quality improves benchmark scores by 2–4 points.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Common Crawl raw size | 400+ | billion tokens | Monthly snapshots; 2021 snapshot ≈ 3.1TB compressed; quality varies widely |
| Retention rate after quality filtering | 5–20% | % | Typical filtering pipeline retains 5–20% of raw Common Crawl; varies by pipeline |
| Deduplication improvement | ~1.5× | perplexity improvement | Lee et al. (2022): removing duplicates reduces perplexity ~1.5× at same training compute |
| Near-deduplication threshold | 0.8 | MinHash Jaccard similarity | Typical threshold for near-duplicate detection using MinHash LSH |
| Code data impact | +10–15% | reasoning benchmark | Chen et al. (2021): including code in pre-training improves mathematical reasoning |
Training data quality is at least as important as model architecture for language model performance. Raw web text contains spam, templated content, low-information pages, and near-duplicate documents. Systematic curation pipelines convert hundreds of terabytes of raw text into training corpora that enable effective language model pre-training.
Filtering Pipeline Stages
| Stage | Method | Typical Reduction |
|---|---|---|
| URL filtering | Blocklist of spam/adult domains | 10–30% |
| Language identification | fastText classifier | 30–60% (for English-only) |
| Length & content heuristics | Min/max document length, symbol ratios | 5–15% |
| Quality scoring | Perplexity vs reference LM; content classifier | 30–70% |
| Near-deduplication | MinHash LSH (Jaccard ≥ 0.8) | 20–40% |
| Exact deduplication | Hash-based | 5–10% |
| Safety/PII filtering | Rule-based + classifier | 2–5% |
Combined pipeline: retains ~5–20% of raw Common Crawl as high-quality training data.
Data Source Composition for Large Models
| Source | Quality | Scale | Common Use |
|---|---|---|---|
| Common Crawl (filtered) | Variable → High after filter | 400B+ tokens/snapshot | Primary pre-training data |
| Wikipedia | High | ~3B tokens | Factual grounding |
| Books corpora | High | 12–100B tokens | Long-form structure |
| GitHub/code | High for code | 100B+ tokens | Reasoning improvement |
| Scientific papers | High | 50B+ tokens | STEM reasoning |
| Web text (curated) | High | 20–50B tokens | Instruction quality |
Deduplication Methods
| Method | Type | Granularity | Complexity |
|---|---|---|---|
| Exact match | Hash (SHA256) | Document, paragraph | O(n) |
| MinHash LSH | Approximate | Document | O(n log n) |
| SimHash | Approximate | Document | O(n) |
| Suffix array | Exact | n-gram | O(n log n), high memory |
Lee et al. (2022) found that suffix array-based substring deduplication is most thorough — removing repeated sequences of ≥50 tokens reduced memorization most effectively and improved model perplexity 1.5× at the same compute.
Impact on Benchmark Performance
Penedo et al. (2023) systematically compared filtering strategies on Common Crawl, finding:
- High-quality filtered data (FineWeb) outperforms unfiltered CC by 2–4 points on MMLU
- Mixing filtered web with curated sources (books, Wikipedia) consistently improves over web-only
- Raising training token count with low-quality data can hurt performance relative to fewer high-quality tokens
See pre-training for how curated data is used in the training loop, and scaling-laws for how dataset quality interacts with compute-optimal token count decisions.
Related Pages
Sources
- Rae et al. (2021) — Scaling Language Models: Methods, Analysis and Insights from Training Gopher. arXiv
- Penedo et al. (2023) — The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv
- Lee et al. (2022) — Deduplicating Training Data Makes Language Models Better. ACL 2022
Frequently Asked Questions
What steps are in a typical web data filtering pipeline?
A typical pipeline includes: (1) URL-level filtering — blocklist of spam, adult content, and low-quality domains; (2) language identification — removing non-target language text; (3) quality filtering — perplexity scoring against a reference LM, text length filtering, symbol/punctuation ratio filters; (4) near-deduplication — MinHash LSH to remove near-duplicate documents; (5) content filtering — removing PII, harmful content. Each stage further reduces data volume while improving quality.
Why does deduplication improve language model training?
Lee et al. (2022) showed that training on deduplicated data significantly improves model quality at the same compute budget. The key mechanism: memorization. Models trained on highly duplicated data (e.g., the same document 100× in Common Crawl) memorize specific text verbatim rather than learning generalizable patterns. Deduplication forces the model to generalize rather than memorize, improving held-out perplexity by approximately 1.5× at identical compute.
How is data quality measured without human labeling?
The most common automated quality signals: (1) reference model perplexity — filter out text that a small, high-quality reference LM assigns high perplexity (i.e., text unlike high-quality sources); (2) content classification — train a binary classifier on known-good vs known-bad examples; (3) linguistic features — sentence count, token-to-word ratio, average word length, punctuation density; (4) URL quality scores — domain-level reputation from human-curated allow/blocklists. These signals are noisy individually but combine to significantly improve corpus quality.