Knowledge Distillation: Soft Targets, Temperature Scaling, and Compression Ratios
Knowledge distillation trains a student on teacher soft labels at temperature T; DistilBERT achieves 97% of BERT's GLUE score at 60% the size (66M vs 110M parameters) using T=4 and a distillation loss coefficient of 0.9 (Sanh et al., 2019).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| DistilBERT size vs BERT | 66M vs 110M parameters | 40% parameter reduction; 60% of original size | |
| DistilBERT GLUE score | 97% | relative to BERT | Sanh et al. (2019); retains 97% of BERT-base performance on GLUE benchmark |
| DistilBERT inference speed | 60% | faster | 60% faster than BERT-base on CPU; 40% smaller memory footprint |
| Distillation temperature (DistilBERT) | 4 | T | T=4 produces softer distributions; temperature used during both training and reference |
| TinyBERT compression | 7.5× | smaller | 14.5M parameters; retains 96.8% of BERT teacher performance on GLUE |
Knowledge distillation, introduced by Hinton et al. (2015), transfers knowledge from a large, well-trained teacher model to a smaller student model. The key insight is that the teacher’s soft probability outputs contain more information than hard labels — they encode the model’s uncertainty and inter-class relationships.
The Distillation Objective
Standard distillation combines two loss terms:
L_total = α · L_CE(y, y_hard) + (1−α) · L_KL(p_teacher/T, p_student/T)
where:
- L_CE = cross-entropy with true labels (hard targets)
- L_KL = KL divergence between teacher and student soft distributions
- T = temperature for soft label computation
- α = weighting coefficient (DistilBERT uses α = 0.1, 1−α = 0.9)
The soft labels are computed as: p_soft(i) = exp(z_i / T) / Σⱼ exp(z_j / T), where z_i are logits.
Temperature Effect on Soft Distributions
For a sample with logits [3.0, 1.5, 0.5] (typical 3-class example):
| Temperature | Class 1 | Class 2 | Class 3 | Information density |
|---|---|---|---|---|
| T = 1 (standard) | 0.771 | 0.188 | 0.040 | Low (hard peak) |
| T = 2 | 0.596 | 0.285 | 0.119 | Moderate |
| T = 4 | 0.439 | 0.324 | 0.237 | High (smooth) |
| T = 10 | 0.370 | 0.345 | 0.285 | Very high (near-uniform) |
Higher temperatures reveal more about the teacher’s uncertainty structure, providing richer training signal for the student.
Distilled Model Benchmarks
| Model | Parameters | GLUE Score | % of Teacher | Speed |
|---|---|---|---|---|
| BERT-base (teacher) | 110M | 79.6 | 100% | 1× |
| DistilBERT | 66M | 77.0 | 97% | 1.6× faster |
| TinyBERT | 14.5M | 76.5 | 96% | 9.4× faster |
What Gets Distilled
| Distillation Target | Information Transferred | Complexity |
|---|---|---|
| Output logits | Final class probabilities | Simple |
| Intermediate activations | Hidden state representations | Medium |
| Attention patterns | Attention weight matrices | High |
| Embedding layer | Token embedding geometry | High |
TinyBERT uses all four levels simultaneously, which is why it achieves better compression ratios than output-only distillation.
See quantization for complementary model compression via precision reduction, and lora-fine-tuning for parameter-efficient adaptation methods that achieve similar goals through fine-tuning rather than training from scratch.
Related Pages
Sources
- Hinton et al. (2015) — Distilling the Knowledge in a Neural Network. NIPS 2015 Workshop
- Sanh et al. (2019) — DistilBERT, a distilled version of BERT. NeurIPS 2019 Workshop
- Jiao et al. (2020) — TinyBERT: Distilling BERT for Natural Language Understanding. EMNLP 2020
Frequently Asked Questions
Why use soft labels from a teacher instead of just training on hard labels?
Hard labels (one-hot vectors) provide training signal only from the correct class. Soft labels from a large teacher encode the teacher's uncertainty distribution across all classes — for instance, a teacher might assign 'automobile' 0.90 probability and 'truck' 0.09 probability. This inter-class similarity information (Hinton et al. call it 'dark knowledge') helps the student generalize better than hard labels alone, particularly with limited data.
What is the role of temperature in knowledge distillation?
Temperature T controls how 'soft' the teacher's probability distribution is. At T=1, the teacher's standard softmax outputs are used. At T=4 (as in DistilBERT), the logits are divided by 4 before softmax, spreading probability more evenly across classes and making the relative sizes of small probabilities more influential during training. A higher temperature emphasizes the relationships between wrong answers, which contain more transferable structural information.
Can distillation work across different architectures?
Yes — the teacher and student need not share the same architecture. TinyBERT (Jiao et al., 2020) distills not only the output logits but also intermediate layer activations, attention patterns, and embedding layers, requiring architecture-specific alignment layers. This 'intermediate distillation' achieves better results than output-only distillation. Cross-architecture distillation is more complex but enables more aggressive compression.