Knowledge Distillation: Soft Targets, Temperature Scaling, and Compression Ratios

Category: representation Updated: 2026-02-27

Knowledge distillation trains a student on teacher soft labels at temperature T; DistilBERT achieves 97% of BERT's GLUE score at 60% the size (66M vs 110M parameters) using T=4 and a distillation loss coefficient of 0.9 (Sanh et al., 2019).

Key Data Points
MeasureValueUnitNotes
DistilBERT size vs BERT66M vs 110M parameters40% parameter reduction; 60% of original size
DistilBERT GLUE score97%relative to BERTSanh et al. (2019); retains 97% of BERT-base performance on GLUE benchmark
DistilBERT inference speed60%faster60% faster than BERT-base on CPU; 40% smaller memory footprint
Distillation temperature (DistilBERT)4TT=4 produces softer distributions; temperature used during both training and reference
TinyBERT compression7.5×smaller14.5M parameters; retains 96.8% of BERT teacher performance on GLUE

Knowledge distillation, introduced by Hinton et al. (2015), transfers knowledge from a large, well-trained teacher model to a smaller student model. The key insight is that the teacher’s soft probability outputs contain more information than hard labels — they encode the model’s uncertainty and inter-class relationships.

The Distillation Objective

Standard distillation combines two loss terms:

L_total = α · L_CE(y, y_hard) + (1−α) · L_KL(p_teacher/T, p_student/T)

where:

  • L_CE = cross-entropy with true labels (hard targets)
  • L_KL = KL divergence between teacher and student soft distributions
  • T = temperature for soft label computation
  • α = weighting coefficient (DistilBERT uses α = 0.1, 1−α = 0.9)

The soft labels are computed as: p_soft(i) = exp(z_i / T) / Σⱼ exp(z_j / T), where z_i are logits.

Temperature Effect on Soft Distributions

For a sample with logits [3.0, 1.5, 0.5] (typical 3-class example):

TemperatureClass 1Class 2Class 3Information density
T = 1 (standard)0.7710.1880.040Low (hard peak)
T = 20.5960.2850.119Moderate
T = 40.4390.3240.237High (smooth)
T = 100.3700.3450.285Very high (near-uniform)

Higher temperatures reveal more about the teacher’s uncertainty structure, providing richer training signal for the student.

Distilled Model Benchmarks

ModelParametersGLUE Score% of TeacherSpeed
BERT-base (teacher)110M79.6100%
DistilBERT66M77.097%1.6× faster
TinyBERT14.5M76.596%9.4× faster

What Gets Distilled

Distillation TargetInformation TransferredComplexity
Output logitsFinal class probabilitiesSimple
Intermediate activationsHidden state representationsMedium
Attention patternsAttention weight matricesHigh
Embedding layerToken embedding geometryHigh

TinyBERT uses all four levels simultaneously, which is why it achieves better compression ratios than output-only distillation.

See quantization for complementary model compression via precision reduction, and lora-fine-tuning for parameter-efficient adaptation methods that achieve similar goals through fine-tuning rather than training from scratch.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why use soft labels from a teacher instead of just training on hard labels?

Hard labels (one-hot vectors) provide training signal only from the correct class. Soft labels from a large teacher encode the teacher's uncertainty distribution across all classes — for instance, a teacher might assign 'automobile' 0.90 probability and 'truck' 0.09 probability. This inter-class similarity information (Hinton et al. call it 'dark knowledge') helps the student generalize better than hard labels alone, particularly with limited data.

What is the role of temperature in knowledge distillation?

Temperature T controls how 'soft' the teacher's probability distribution is. At T=1, the teacher's standard softmax outputs are used. At T=4 (as in DistilBERT), the logits are divided by 4 before softmax, spreading probability more evenly across classes and making the relative sizes of small probabilities more influential during training. A higher temperature emphasizes the relationships between wrong answers, which contain more transferable structural information.

Can distillation work across different architectures?

Yes — the teacher and student need not share the same architecture. TinyBERT (Jiao et al., 2020) distills not only the output logits but also intermediate layer activations, attention patterns, and embedding layers, requiring architecture-specific alignment layers. This 'intermediate distillation' achieves better results than output-only distillation. Cross-architecture distillation is more complex but enables more aggressive compression.

← All AI pages · Dashboard