RLHF: Reinforcement Learning from Human Feedback — Reward Model and PPO
RLHF trains a reward model on human pairwise preferences, then optimizes via PPO with KL penalty: R = r_θ(x,y) − β·KL(π_RL || π_SFT); introduced for language models by Stiennon et al. (NeurIPS 2020), extended by InstructGPT (Ouyang et al., 2022).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| RLHF reward function | R(x,y) = r_θ(x,y) − β·KL(π_RL(y|x) || π_SFT(y|x)) | r_θ = learned reward model; β = KL penalty coefficient; KL term penalizes divergence from SFT baseline | |
| KL penalty coefficient (β) | 0.01–0.1 | Typical range; higher β = more conservative; lower β = more reward optimization | |
| InstructGPT human evaluation | 85% | % preference vs baseline | Ouyang et al. (2022): labelers preferred InstructGPT-1.3B over GPT-3 175B outputs 85% of the time |
| Reward model training size | ~33,000 | comparison pairs | InstructGPT: 33K human pairwise comparisons used to train reward model |
| SFT warmup dataset | ~13,000 | labeled prompts | InstructGPT: supervised fine-tuning on 13K high-quality human-written demonstrations first |
Reinforcement Learning from Human Feedback (RLHF) is a three-phase training procedure that aligns language model outputs with human preferences using pairwise comparison data and policy gradient optimization. Introduced by Stiennon et al. (2020) for summarization and extended to instruction-following by Ouyang et al. (2022), it has become the dominant method for producing helpful, harmless AI assistants.
The Three Phases
Phase 1: Supervised Fine-Tuning (SFT)
Fine-tune the base pre-trained model on a curated dataset of (prompt, preferred response) pairs:
- Labelers write high-quality responses to sampled prompts
- Standard cross-entropy training; typically 1–3 epochs
- Produces π_SFT: a starting policy for RL optimization
Phase 2: Reward Model Training
Collect human preference comparisons: given prompt x and two responses (y_A, y_B), label which is better.
Reward model objective: maximize P(y_A ≻ y_B | x) = σ(r_θ(x, y_A) − r_θ(x, y_B))
| InstructGPT Data | Count |
|---|---|
| SFT demonstrations | ~13,000 |
| Comparison pairs | ~33,000 |
| Total prompts | ~40,000 |
Phase 3: RL Fine-Tuning with PPO
Optimize π_RL to maximize the augmented reward:
R(x, y) = r_θ(x, y) − β · KL(π_RL(y|x) || π_SFT(y|x))
PPO updates the policy using clipped objectives to prevent large, destabilizing policy updates.
PPO Clip Objective
The PPO loss clips the probability ratio to prevent large updates:
L_CLIP(θ) = E[min(r_t(θ)·Â_t, clip(r_t(θ), 1−ε, 1+ε)·Â_t)]
where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) and ε = 0.2 (typical clipping range).
InstructGPT Results (Ouyang et al., 2022)
| Evaluation | InstructGPT 1.3B | GPT-3 175B | Winner |
|---|---|---|---|
| Human preference | — | — | InstructGPT (85% preferred) |
| Truthfulness (TruthfulQA) | 41% | 22% | InstructGPT (+19%) |
| Toxicity (RealToxicityPrompts) | ~25% reduction | — | InstructGPT |
| NLP benchmark performance | Slight regression | — | GPT-3 (RLHF hurts slightly) |
The human preference result — that 1.3B RLHF parameters outperform 175B supervised-only parameters — demonstrates that alignment training is highly efficient: a 100× smaller model with better training can be more useful in practice.
See constitutional-ai for a feedback-reduction approach to alignment, reinforcement-learning-basics for the RL foundations, and alignment-problem for the broader context of why alignment is difficult.
Related Pages
Sources
- Stiennon et al. (2020) — Learning to Summarize with Human Feedback. NeurIPS 2020
- Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS 2022
- Schulman et al. (2017) — Proximal Policy Optimization Algorithms. arXiv
Frequently Asked Questions
What are the three phases of RLHF training?
Phase 1 (SFT): fine-tune the pre-trained language model on a dataset of human-written demonstrations (prompt, response pairs) using standard supervised learning. Phase 2 (Reward model): collect human pairwise comparisons (which response is better?) and train a classifier to predict human preferences. Phase 3 (RL fine-tuning): use PPO to optimize the SFT model to maximize the reward model's score, with a KL divergence penalty to prevent the policy from collapsing to reward-hacking behaviors.
Why is a KL penalty needed in RLHF?
Without the KL penalty, the RL policy can 'reward hack' — finding inputs that fool the reward model into giving high scores without actually being helpful or truthful. The reward model is an imperfect proxy for human preferences and has exploitable weaknesses. The penalty R = r_θ(x,y) − β·KL(π_RL || π_SFT) keeps the optimized policy close to the supervised baseline, limiting how aggressively it can exploit reward model flaws. This is a direct application of Goodhart's Law: when the measure becomes a target, it ceases to be a good measure.
What did Stiennon et al. (2020) demonstrate about RLHF for summarization?
Stiennon et al. trained a reward model on ~64,000 human preference comparisons between TL;DR summaries, then optimized a GPT-3-based summarizer using PPO. The RLHF-optimized model was preferred by human evaluators 65–75% of the time over supervised fine-tuning baselines. This paper established that RLHF could significantly improve human-perceived quality beyond what standard supervised learning achieves.