Question 1

Why use soft labels from a teacher instead of just training on hard labels?

Accepted Answer

Hard labels (one-hot vectors) provide training signal only from the correct class. Soft labels from a large teacher encode the teacher's uncertainty distribution across all classes — for instance, a teacher might assign 'automobile' 0.90 probability and 'truck' 0.09 probability. This inter-class similarity information (Hinton et al. call it 'dark knowledge') helps the student generalize better than hard labels alone, particularly with limited data.

Question 2

What is the role of temperature in knowledge distillation?

Accepted Answer

Temperature T controls how 'soft' the teacher's probability distribution is. At T=1, the teacher's standard softmax outputs are used. At T=4 (as in DistilBERT), the logits are divided by 4 before softmax, spreading probability more evenly across classes and making the relative sizes of small probabilities more influential during training. A higher temperature emphasizes the relationships between wrong answers, which contain more transferable structural information.

Question 3

Can distillation work across different architectures?

Accepted Answer

Yes — the teacher and student need not share the same architecture. TinyBERT (Jiao et al., 2020) distills not only the output logits but also intermediate layer activations, attention patterns, and embedding layers, requiring architecture-specific alignment layers. This 'intermediate distillation' achieves better results than output-only distillation. Cross-architecture distillation is more complex but enables more aggressive compression.

Measure	Value	Unit	Notes
DistilBERT size vs BERT	66M vs 110M parameters		40% parameter reduction; 60% of original size
DistilBERT GLUE score	97%	relative to BERT	Sanh et al. (2019); retains 97% of BERT-base performance on GLUE benchmark
DistilBERT inference speed	60%	faster	60% faster than BERT-base on CPU; 40% smaller memory footprint
Distillation temperature (DistilBERT)	4	T	T=4 produces softer distributions; temperature used during both training and reference
TinyBERT compression	7.5×	smaller	14.5M parameters; retains 96.8% of BERT teacher performance on GLUE

Temperature	Class 1	Class 2	Class 3	Information density
T = 1 (standard)	0.771	0.188	0.040	Low (hard peak)
T = 2	0.596	0.285	0.119	Moderate
T = 4	0.439	0.324	0.237	High (smooth)
T = 10	0.370	0.345	0.285	Very high (near-uniform)

Model	Parameters	GLUE Score	% of Teacher	Speed
BERT-base (teacher)	110M	79.6	100%	1×
DistilBERT	66M	77.0	97%	1.6× faster
TinyBERT	14.5M	76.5	96%	9.4× faster

Distillation Target	Information Transferred	Complexity
Output logits	Final class probabilities	Simple
Intermediate activations	Hidden state representations	Medium
Attention patterns	Attention weight matrices	High
Embedding layer	Token embedding geometry	High

Knowledge Distillation: Soft Targets, Temperature Scaling, and Compression Ratios

The Distillation Objective

Temperature Effect on Soft Distributions

Distilled Model Benchmarks

What Gets Distilled

Related Pages

Sources

Frequently Asked Questions

Why use soft labels from a teacher instead of just training on hard labels?

What is the role of temperature in knowledge distillation?

Can distillation work across different architectures?