Question 1

Why does BERT use both [MASK], random, and unchanged tokens?

Accepted Answer

If all 15% of selected tokens were always replaced with [MASK], the model would learn to predict tokens only when seeing [MASK] — but [MASK] never appears at inference time. To prevent this train/test mismatch, 10% of selected tokens are replaced with a random word and 10% are left unchanged. The model must learn to predict the original token even when the input appears normal, making representations more robust and usable for downstream tasks without masking.

Question 2

What is the difference between MLM and causal language modeling?

Accepted Answer

MLM uses bidirectional context — when predicting a masked token, the model can attend to tokens both before and after the mask. Causal LM (next-token prediction) uses only left context. Bidirectional context makes MLM-trained models (like BERT) better for understanding tasks (classification, question answering, named entity recognition) but unable to generate text autoregressively. Causal LM models are naturally generative but rely on left-to-right context only.

Question 3

What is dynamic masking and why does it help?

Accepted Answer

Static masking (original BERT) generates the mask once during data preprocessing, so the model sees the same masked positions repeatedly across epochs. Dynamic masking (RoBERTa, Liu et al. 2019) generates a new random mask for each training instance at each epoch, so the model never sees the same (sequence, mask) pair twice. Liu et al. found this improves GLUE by ~1.2 points and is one of several optimizations in RoBERTa that improved on BERT without architectural changes.

Measure	Value	Unit	Notes
Masking rate	15%	of tokens	Per input sequence; chosen as balance between too few signal vs. too much corruption
Masking strategy breakdown	80% [MASK], 10% random, 10% unchanged		Random and unchanged tokens prevent model from only learning to predict [MASK] tokens
BERT-large GLUE score	87.6	GLUE average	Devlin et al. (2019); +7.0 points over prior best; trained on 3.3B words × 40 epochs
RoBERTa masking improvement	+1.2	GLUE points	Dynamic masking (new mask per epoch) vs static masking; Liu et al. (2019)
BERT training tokens	~13.7	billion	3.3B BooksCorpus + Wikipedia × 40 epochs at 128 seq_len + 40 epochs at 512

Strategy	Benefit	Problem
100% [MASK]	Strong learning signal	[MASK] never appears at inference; train-test mismatch
100% unchanged	No train-test mismatch	No masking signal; model doesn’t learn to predict
80/10/10 (BERT)	Strong signal for 80%	Only slight mismatch from 10% random; 10% identity adds robustness

Property	MLM (BERT-style)	Causal LM (GPT-style)
Context direction	Bidirectional	Left-to-right only
Primary architecture	Encoder-only	Decoder-only
Good for	Classification, extraction, NER	Text generation, completion
Pre-training data efficiency	Higher (sees each token from both directions)	Lower
Fine-tuning approach	Add classification head; fine-tune all weights	Prompt/few-shot or fine-tune

Task	Metric	BERT-large	Prior SOTA	Improvement
GLUE	Average	87.6	80.6	+7.0
SQuAD v1.1 F1	F1	93.2	91.6	+1.6
SQuAD v2.0 F1	F1	83.1	78.0	+5.1
MultiNLI	Accuracy	86.7	82.1	+4.6

Change	GLUE Improvement
Dynamic masking (new mask per epoch)	+1.2
Removing Next Sentence Prediction (NSP)	+0.9
Larger batch size (8K vs 256)	+0.8
More training data (160GB vs 16GB)	+1.4
Longer training	+0.5

Masked Language Modeling: BERT's Pre-Training Objective and Bidirectional Context

The BERT Masking Procedure

Why the 80/10/10 Split?

MLM vs Causal LM Comparison

BERT Performance on Downstream Tasks

RoBERTa Improvements Over BERT

Related Pages

Sources

Frequently Asked Questions

Why does BERT use both [MASK], random, and unchanged tokens?

What is the difference between MLM and causal language modeling?

What is dynamic masking and why does it help?