Pre-Training: Self-Supervised Learning on Large Text Corpora
Pre-training on large corpora with self-supervised objectives (causal LM or MLM) produces general representations; GPT-3 was pre-trained on 300B tokens at 175B parameters using ~3.14×10²³ FLOPs (Brown et al., 2020).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GPT-3 pre-training tokens | 300 | billion tokens | Brown et al. (2020); trained on mix of Common Crawl, WebText2, Books, Wikipedia |
| GPT-3 parameters | 175 | billion | 96 layers, d_model=12288, 96 attention heads; dense decoder-only transformer |
| GPT-3 pre-training FLOPs | 3.14 × 10²³ | FLOPs | Estimated as 6·N·D where N=175B, D=300B (Kaplan scaling law formula) |
| BERT pre-training tokens | ~13.7 | billion tokens | 3.3B Wikipedia + 800M BooksCorpus × 40 epochs; much smaller than GPT-3 |
| Typical pre-training data composition | 60% web, 22% books, 16% Wikipedia | Approximate breakdown for large language model pre-training datasets |
Pre-training is the first phase of the two-stage (pre-train, fine-tune) paradigm that defines modern language model development. During pre-training, a transformer is trained on large unlabeled text corpora using self-supervised objectives — tasks where the labels are derived from the text itself, requiring no human annotation.
Pre-Training Objectives
| Objective | Architecture | Training Signal | Example Models |
|---|---|---|---|
| Causal Language Modeling (CLM) | Decoder-only | Predict next token from left context | GPT family |
| Masked Language Modeling (MLM) | Encoder-only | Predict masked tokens from both directions | BERT |
| Prefix Language Modeling | Encoder-decoder | Predict continuation given prefix | T5, FLAN |
| Denoising (corrupted spans) | Encoder-decoder | Reconstruct corrupted spans | T5 |
Pre-Training Data Composition (GPT-3)
| Source | Tokens (filtered) | Weight |
|---|---|---|
| Common Crawl (filtered) | ~410B | 60% |
| WebText2 | ~19B | 22% |
| Books1 | ~12B | 8% |
| Books2 | ~55B | 8% |
| Wikipedia | ~3B | 3% |
| Total | ~300B | 100% |
Training Configuration Evolution
| Model | Year | Parameters | Tokens | FLOPs |
|---|---|---|---|---|
| GPT | 2018 | 117M | 5B | ~3 × 10¹⁷ |
| BERT-large | 2019 | 340M | 13.7B | ~3 × 10²⁰ |
| GPT-2 | 2019 | 1.5B | 40B | ~4.8 × 10²⁰ |
| GPT-3 | 2020 | 175B | 300B | 3.14 × 10²³ |
The 6·N·D Rule
A widely-used estimate for training FLOPs:
C ≈ 6 · N · D
where N = number of parameters and D = number of training tokens. The factor 6 accounts for forward pass (~2·N·D), backward pass (~4·N·D), and gradient computation. Kaplan et al. (2020) validated this formula across many scales.
For GPT-3: 6 × 175B × 300B = 3.15 × 10²³ FLOPs — consistent with the reported estimate.
See scaling-laws for how pre-training performance scales with N, D, and C, training-data-curation for how raw web text is filtered into usable training data, and next-token-prediction for the causal LM objective in detail.
Related Pages
Sources
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020
- Devlin et al. (2019) — BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019
- Radford et al. (2018) — Improving Language Understanding by Generative Pre-Training (GPT). OpenAI Blog
Frequently Asked Questions
Why is pre-training effective for downstream tasks?
Pre-training on diverse text exposes the model to a vast range of linguistic patterns, factual associations, and reasoning chains. The model develops representations that capture syntax, semantics, and world knowledge. When fine-tuned on a downstream task, only a small labeled dataset is needed to specialize these general representations, rather than learning from scratch. This is the 'pre-train then fine-tune' paradigm that has defined NLP since 2018.
What data is used for pre-training?
Large language models are pre-trained on filtered web text (Common Crawl derivatives), books corpora (Books1, Books2), Wikipedia, code repositories (GitHub), scientific papers, and other high-quality text sources. Brown et al. (2020) found that data quality matters significantly — the mix of data sources and filtering applied to Common Crawl substantially affects downstream performance. Typical pre-training corpora for large models contain 1–15 trillion tokens.
How many training steps does pre-training require?
Pre-training step count = D / (B × L) where D = total tokens, B = batch size (tokens), L = sequence length. GPT-3 with 300B tokens, batch size 3.2M tokens: 300B/3.2M ≈ 93,750 steps. For comparison, BERT used ~1M steps at much smaller batch sizes. Modern large-scale pre-training typically runs 250K–1M optimizer steps, with each step processing millions of tokens.