Pre-Training: Self-Supervised Learning on Large Text Corpora

Category: training Updated: 2026-02-27

Pre-training on large corpora with self-supervised objectives (causal LM or MLM) produces general representations; GPT-3 was pre-trained on 300B tokens at 175B parameters using ~3.14×10²³ FLOPs (Brown et al., 2020).

Key Data Points
MeasureValueUnitNotes
GPT-3 pre-training tokens300billion tokensBrown et al. (2020); trained on mix of Common Crawl, WebText2, Books, Wikipedia
GPT-3 parameters175billion96 layers, d_model=12288, 96 attention heads; dense decoder-only transformer
GPT-3 pre-training FLOPs3.14 × 10²³FLOPsEstimated as 6·N·D where N=175B, D=300B (Kaplan scaling law formula)
BERT pre-training tokens~13.7billion tokens3.3B Wikipedia + 800M BooksCorpus × 40 epochs; much smaller than GPT-3
Typical pre-training data composition60% web, 22% books, 16% WikipediaApproximate breakdown for large language model pre-training datasets

Pre-training is the first phase of the two-stage (pre-train, fine-tune) paradigm that defines modern language model development. During pre-training, a transformer is trained on large unlabeled text corpora using self-supervised objectives — tasks where the labels are derived from the text itself, requiring no human annotation.

Pre-Training Objectives

ObjectiveArchitectureTraining SignalExample Models
Causal Language Modeling (CLM)Decoder-onlyPredict next token from left contextGPT family
Masked Language Modeling (MLM)Encoder-onlyPredict masked tokens from both directionsBERT
Prefix Language ModelingEncoder-decoderPredict continuation given prefixT5, FLAN
Denoising (corrupted spans)Encoder-decoderReconstruct corrupted spansT5

Pre-Training Data Composition (GPT-3)

SourceTokens (filtered)Weight
Common Crawl (filtered)~410B60%
WebText2~19B22%
Books1~12B8%
Books2~55B8%
Wikipedia~3B3%
Total~300B100%

Training Configuration Evolution

ModelYearParametersTokensFLOPs
GPT2018117M5B~3 × 10¹⁷
BERT-large2019340M13.7B~3 × 10²⁰
GPT-220191.5B40B~4.8 × 10²⁰
GPT-32020175B300B3.14 × 10²³

The 6·N·D Rule

A widely-used estimate for training FLOPs:

C ≈ 6 · N · D

where N = number of parameters and D = number of training tokens. The factor 6 accounts for forward pass (~2·N·D), backward pass (~4·N·D), and gradient computation. Kaplan et al. (2020) validated this formula across many scales.

For GPT-3: 6 × 175B × 300B = 3.15 × 10²³ FLOPs — consistent with the reported estimate.

See scaling-laws for how pre-training performance scales with N, D, and C, training-data-curation for how raw web text is filtered into usable training data, and next-token-prediction for the causal LM objective in detail.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is pre-training effective for downstream tasks?

Pre-training on diverse text exposes the model to a vast range of linguistic patterns, factual associations, and reasoning chains. The model develops representations that capture syntax, semantics, and world knowledge. When fine-tuned on a downstream task, only a small labeled dataset is needed to specialize these general representations, rather than learning from scratch. This is the 'pre-train then fine-tune' paradigm that has defined NLP since 2018.

What data is used for pre-training?

Large language models are pre-trained on filtered web text (Common Crawl derivatives), books corpora (Books1, Books2), Wikipedia, code repositories (GitHub), scientific papers, and other high-quality text sources. Brown et al. (2020) found that data quality matters significantly — the mix of data sources and filtering applied to Common Crawl substantially affects downstream performance. Typical pre-training corpora for large models contain 1–15 trillion tokens.

How many training steps does pre-training require?

Pre-training step count = D / (B × L) where D = total tokens, B = batch size (tokens), L = sequence length. GPT-3 with 300B tokens, batch size 3.2M tokens: 300B/3.2M ≈ 93,750 steps. For comparison, BERT used ~1M steps at much smaller batch sizes. Modern large-scale pre-training typically runs 250K–1M optimizer steps, with each step processing millions of tokens.

← All AI pages · Dashboard