Question 1

What steps are in a typical web data filtering pipeline?

Accepted Answer

A typical pipeline includes: (1) URL-level filtering — blocklist of spam, adult content, and low-quality domains; (2) language identification — removing non-target language text; (3) quality filtering — perplexity scoring against a reference LM, text length filtering, symbol/punctuation ratio filters; (4) near-deduplication — MinHash LSH to remove near-duplicate documents; (5) content filtering — removing PII, harmful content. Each stage further reduces data volume while improving quality.

Question 2

Why does deduplication improve language model training?

Accepted Answer

Lee et al. (2022) showed that training on deduplicated data significantly improves model quality at the same compute budget. The key mechanism: memorization. Models trained on highly duplicated data (e.g., the same document 100× in Common Crawl) memorize specific text verbatim rather than learning generalizable patterns. Deduplication forces the model to generalize rather than memorize, improving held-out perplexity by approximately 1.5× at identical compute.

Question 3

How is data quality measured without human labeling?

Accepted Answer

The most common automated quality signals: (1) reference model perplexity — filter out text that a small, high-quality reference LM assigns high perplexity (i.e., text unlike high-quality sources); (2) content classification — train a binary classifier on known-good vs known-bad examples; (3) linguistic features — sentence count, token-to-word ratio, average word length, punctuation density; (4) URL quality scores — domain-level reputation from human-curated allow/blocklists. These signals are noisy individually but combine to significantly improve corpus quality.

Measure	Value	Unit	Notes
Common Crawl raw size	400+	billion tokens	Monthly snapshots; 2021 snapshot ≈ 3.1TB compressed; quality varies widely
Retention rate after quality filtering	5–20%	%	Typical filtering pipeline retains 5–20% of raw Common Crawl; varies by pipeline
Deduplication improvement	~1.5×	perplexity improvement	Lee et al. (2022): removing duplicates reduces perplexity ~1.5× at same training compute
Near-deduplication threshold	0.8	MinHash Jaccard similarity	Typical threshold for near-duplicate detection using MinHash LSH
Code data impact	+10–15%	reasoning benchmark	Chen et al. (2021): including code in pre-training improves mathematical reasoning

Stage	Method	Typical Reduction
URL filtering	Blocklist of spam/adult domains	10–30%
Language identification	fastText classifier	30–60% (for English-only)
Length & content heuristics	Min/max document length, symbol ratios	5–15%
Quality scoring	Perplexity vs reference LM; content classifier	30–70%
Near-deduplication	MinHash LSH (Jaccard ≥ 0.8)	20–40%
Exact deduplication	Hash-based	5–10%
Safety/PII filtering	Rule-based + classifier	2–5%

Source	Quality	Scale	Common Use
Common Crawl (filtered)	Variable → High after filter	400B+ tokens/snapshot	Primary pre-training data
Wikipedia	High	~3B tokens	Factual grounding
Books corpora	High	12–100B tokens	Long-form structure
GitHub/code	High for code	100B+ tokens	Reasoning improvement
Scientific papers	High	50B+ tokens	STEM reasoning
Web text (curated)	High	20–50B tokens	Instruction quality

Method	Type	Granularity	Complexity
Exact match	Hash (SHA256)	Document, paragraph	O(n)
MinHash LSH	Approximate	Document	O(n log n)
SimHash	Approximate	Document	O(n)
Suffix array	Exact	n-gram	O(n log n), high memory

Training Data Curation: Web Filtering, Deduplication, and Quality Selection

Filtering Pipeline Stages

Data Source Composition for Large Models

Deduplication Methods

Impact on Benchmark Performance

Related Pages

Sources

Frequently Asked Questions

What steps are in a typical web data filtering pipeline?

Why does deduplication improve language model training?

How is data quality measured without human labeling?