Question 1

Why is pre-training effective for downstream tasks?

Accepted Answer

Pre-training on diverse text exposes the model to a vast range of linguistic patterns, factual associations, and reasoning chains. The model develops representations that capture syntax, semantics, and world knowledge. When fine-tuned on a downstream task, only a small labeled dataset is needed to specialize these general representations, rather than learning from scratch. This is the 'pre-train then fine-tune' paradigm that has defined NLP since 2018.

Question 2

What data is used for pre-training?

Accepted Answer

Large language models are pre-trained on filtered web text (Common Crawl derivatives), books corpora (Books1, Books2), Wikipedia, code repositories (GitHub), scientific papers, and other high-quality text sources. Brown et al. (2020) found that data quality matters significantly — the mix of data sources and filtering applied to Common Crawl substantially affects downstream performance. Typical pre-training corpora for large models contain 1–15 trillion tokens.

Question 3

How many training steps does pre-training require?

Accepted Answer

Pre-training step count = D / (B × L) where D = total tokens, B = batch size (tokens), L = sequence length. GPT-3 with 300B tokens, batch size 3.2M tokens: 300B/3.2M ≈ 93,750 steps. For comparison, BERT used ~1M steps at much smaller batch sizes. Modern large-scale pre-training typically runs 250K–1M optimizer steps, with each step processing millions of tokens.

Measure	Value	Unit	Notes
GPT-3 pre-training tokens	300	billion tokens	Brown et al. (2020); trained on mix of Common Crawl, WebText2, Books, Wikipedia
GPT-3 parameters	175	billion	96 layers, d_model=12288, 96 attention heads; dense decoder-only transformer
GPT-3 pre-training FLOPs	3.14 × 10²³	FLOPs	Estimated as 6·N·D where N=175B, D=300B (Kaplan scaling law formula)
BERT pre-training tokens	~13.7	billion tokens	3.3B Wikipedia + 800M BooksCorpus × 40 epochs; much smaller than GPT-3
Typical pre-training data composition	60% web, 22% books, 16% Wikipedia		Approximate breakdown for large language model pre-training datasets

Objective	Architecture	Training Signal	Example Models
Causal Language Modeling (CLM)	Decoder-only	Predict next token from left context	GPT family
Masked Language Modeling (MLM)	Encoder-only	Predict masked tokens from both directions	BERT
Prefix Language Modeling	Encoder-decoder	Predict continuation given prefix	T5, FLAN
Denoising (corrupted spans)	Encoder-decoder	Reconstruct corrupted spans	T5

Source	Tokens (filtered)	Weight
Common Crawl (filtered)	~410B	60%
WebText2	~19B	22%
Books1	~12B	8%
Books2	~55B	8%
Wikipedia	~3B	3%
Total	~300B	100%

Model	Year	Parameters	Tokens	FLOPs
GPT	2018	117M	5B	~3 × 10¹⁷
BERT-large	2019	340M	13.7B	~3 × 10²⁰
GPT-2	2019	1.5B	40B	~4.8 × 10²⁰
GPT-3	2020	175B	300B	3.14 × 10²³

Pre-Training: Self-Supervised Learning on Large Text Corpora

Pre-Training Objectives

Pre-Training Data Composition (GPT-3)

Training Configuration Evolution

The 6·N·D Rule

Related Pages

Sources

Frequently Asked Questions

Why is pre-training effective for downstream tasks?

What data is used for pre-training?

How many training steps does pre-training require?