Masked Language Modeling: BERT's Pre-Training Objective and Bidirectional Context
BERT's MLM masks 15% of tokens — 80% replaced with [MASK], 10% random token, 10% unchanged — enabling bidirectional context encoding; BERT-large achieved 87.6 on GLUE, surpassing all prior models by 7.0 points (Devlin et al., 2019).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Masking rate | 15% | of tokens | Per input sequence; chosen as balance between too few signal vs. too much corruption |
| Masking strategy breakdown | 80% [MASK], 10% random, 10% unchanged | Random and unchanged tokens prevent model from only learning to predict [MASK] tokens | |
| BERT-large GLUE score | 87.6 | GLUE average | Devlin et al. (2019); +7.0 points over prior best; trained on 3.3B words × 40 epochs |
| RoBERTa masking improvement | +1.2 | GLUE points | Dynamic masking (new mask per epoch) vs static masking; Liu et al. (2019) |
| BERT training tokens | ~13.7 | billion | 3.3B BooksCorpus + Wikipedia × 40 epochs at 128 seq_len + 40 epochs at 512 |
Masked language modeling (MLM), introduced by Devlin et al. in “BERT: Pre-training of Deep Bidirectional Transformers” (NAACL 2019), trains encoder-only transformers by predicting randomly masked tokens using full bidirectional context. This differs fundamentally from causal language modeling, which only uses left-to-right context.
The BERT Masking Procedure
For each training sequence:
- Randomly select 15% of token positions for prediction
- For each selected position:
- 80% of the time: replace with [MASK] token
- 10% of the time: replace with a random token from the vocabulary
- 10% of the time: keep the original token unchanged
- Train the model to predict the original token at all selected positions using cross-entropy loss
Only selected positions contribute to the loss — the other 85% of tokens are used as context but not predicted.
Why the 80/10/10 Split?
| Strategy | Benefit | Problem |
|---|---|---|
| 100% [MASK] | Strong learning signal | [MASK] never appears at inference; train-test mismatch |
| 100% unchanged | No train-test mismatch | No masking signal; model doesn’t learn to predict |
| 80/10/10 (BERT) | Strong signal for 80% | Only slight mismatch from 10% random; 10% identity adds robustness |
MLM vs Causal LM Comparison
| Property | MLM (BERT-style) | Causal LM (GPT-style) |
|---|---|---|
| Context direction | Bidirectional | Left-to-right only |
| Primary architecture | Encoder-only | Decoder-only |
| Good for | Classification, extraction, NER | Text generation, completion |
| Pre-training data efficiency | Higher (sees each token from both directions) | Lower |
| Fine-tuning approach | Add classification head; fine-tune all weights | Prompt/few-shot or fine-tune |
BERT Performance on Downstream Tasks
| Task | Metric | BERT-large | Prior SOTA | Improvement |
|---|---|---|---|---|
| GLUE | Average | 87.6 | 80.6 | +7.0 |
| SQuAD v1.1 F1 | F1 | 93.2 | 91.6 | +1.6 |
| SQuAD v2.0 F1 | F1 | 83.1 | 78.0 | +5.1 |
| MultiNLI | Accuracy | 86.7 | 82.1 | +4.6 |
RoBERTa Improvements Over BERT
Liu et al. (2019) identified several training choices that significantly impacted BERT’s performance:
| Change | GLUE Improvement |
|---|---|
| Dynamic masking (new mask per epoch) | +1.2 |
| Removing Next Sentence Prediction (NSP) | +0.9 |
| Larger batch size (8K vs 256) | +0.8 |
| More training data (160GB vs 16GB) | +1.4 |
| Longer training | +0.5 |
See next-token-prediction for the causal LM objective comparison, fine-tuning for how MLM-pre-trained models are adapted for downstream tasks, and scaling-laws for how MLM pre-training scales with compute.
Related Pages
Sources
- Devlin et al. (2019) — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019
- Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv
- Baevski et al. (2019) — cloze-driven Pretraining of Self-attention Networks. EMNLP 2019
Frequently Asked Questions
Why does BERT use both [MASK], random, and unchanged tokens?
If all 15% of selected tokens were always replaced with [MASK], the model would learn to predict tokens only when seeing [MASK] — but [MASK] never appears at inference time. To prevent this train/test mismatch, 10% of selected tokens are replaced with a random word and 10% are left unchanged. The model must learn to predict the original token even when the input appears normal, making representations more robust and usable for downstream tasks without masking.
What is the difference between MLM and causal language modeling?
MLM uses bidirectional context — when predicting a masked token, the model can attend to tokens both before and after the mask. Causal LM (next-token prediction) uses only left context. Bidirectional context makes MLM-trained models (like BERT) better for understanding tasks (classification, question answering, named entity recognition) but unable to generate text autoregressively. Causal LM models are naturally generative but rely on left-to-right context only.
What is dynamic masking and why does it help?
Static masking (original BERT) generates the mask once during data preprocessing, so the model sees the same masked positions repeatedly across epochs. Dynamic masking (RoBERTa, Liu et al. 2019) generates a new random mask for each training instance at each epoch, so the model never sees the same (sequence, mask) pair twice. Liu et al. found this improves GLUE by ~1.2 points and is one of several optimizations in RoBERTa that improved on BERT without architectural changes.