Reinforcement Learning Basics: MDPs, Policy Gradients, and PPO
PPO (Schulman et al., 2017): clipped surrogate objective prevents destructive policy updates; achieves better sample efficiency than TRPO with simpler implementation; PPO is the standard RL optimizer in RLHF pipelines.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| PPO clip parameter ε | 0.2 | Schulman et al. default; clips policy ratio r(θ) to [1-ε, 1+ε] to prevent large updates | |
| REINFORCE variance reduction | Baseline subtraction | Subtracting a state-dependent baseline (value function) reduces gradient variance without bias | |
| Discount factor γ in language RL | 1.0 | RLHF pipelines typically use γ=1 since reward is only given at end of sequence (episodic) | |
| PPO rollout buffer size | 2048–8192 | tokens/steps | Typical RLHF implementations collect this many response tokens before each gradient update |
| KL penalty coefficient β | 0.01–0.1 | β scales the KL divergence from the reference policy in the RLHF reward: R = r_θ − β·KL |
Reinforcement learning (RL) provides the mathematical framework for training agents to maximize reward through interaction with an environment. For language model alignment, RL — specifically policy gradient methods and Proximal Policy Optimization (PPO) — is the optimization method used in RLHF to maximize human preference scores.
The MDP Framework for Language Models
In the language model setting, the MDP elements map to:
| MDP Concept | Language Model Equivalent |
|---|---|
| State s | Token sequence generated so far |
| Action a | Next token to generate (vocab size ~50K) |
| Transition P(s’|s,a) | Deterministic: append a to s |
| Reward R(s,a) | Preference score at end of response |
| Policy π(a|s) | Language model (softmax output) |
| Episode | One complete prompt-response pair |
Policy Gradient Theorem
The policy gradient theorem (Sutton & Barto, 2018) provides an unbiased gradient estimate for the expected return:
∇_θ J(θ) = E_π[G_t · ∇_θ log π_θ(a_t|s_t)]
Where G_t = Σ_{k=t}^{T} γ^{k-t} R_k is the discounted return from time t. This is the REINFORCE algorithm (Williams, 1992). The key insight: we can estimate this expectation by sampling trajectories from the current policy and computing the log-probability gradient.
Advantage Estimation
Raw return G_t has high variance. The advantage function subtracts a state-value baseline:
A(s_t, a_t) = G_t - V(s_t)
Where V(s_t) = E_π[G_t | s_t] is the expected return from state s_t (estimated by a learned value network). This reduces variance without introducing bias. In practice, Generalized Advantage Estimation (GAE, Schulman et al.) is used:
A^{GAE}t = Σ_l (γλ)^l δ{t+l}, where δ_t = R_t + γV(s_{t+1}) - V(s_t)
With λ ∈ [0,1] trading bias for variance.
PPO: Proximal Policy Optimization
PPO (Schulman et al., 2017) constrains policy updates to prevent destructive large steps. The clipped surrogate objective:
L^{CLIP}(θ) = E[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]
Where r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the probability ratio between new and old policy.
| r_t(θ) value | A_t > 0 (good action) | A_t < 0 (bad action) |
|---|---|---|
| r_t = 1.0 | Gradient applies | Gradient applies |
| r_t = 1.3 (ε=0.2) | Clipped (no more update) | Not clipped |
| r_t = 0.7 (ε=0.2) | Not clipped | Clipped (no more update) |
The clip prevents the policy from deviating too far from the old policy in a single update step.
RLHF-Specific Reward Formulation
In RLHF, the reward combines the preference score with a KL penalty:
R_total(x, y) = r_φ(x, y) − β · KL[π_θ(·|x) || π_ref(·|x)]
Where:
- r_φ(x, y) is the learned reward model score
- β controls the strength of the KL penalty
- π_ref is the supervised fine-tuned reference policy
- KL divergence penalizes deviation from the reference policy
| β value | Effect |
|---|---|
| β = 0 | No KL constraint; policy collapses to reward hacking |
| β = 0.01 | Weak regularization; allows large policy changes |
| β = 0.1 | Standard; balances reward and policy stability |
| β = 1.0 | Strong constraint; limits adaptation to reference |
See rlhf for how PPO is applied in the full RLHF pipeline, gradient-descent for the underlying optimization methods, and alignment-problem for why RL is needed for alignment rather than supervised methods alone.
Related Pages
Sources
- Schulman et al. (2017) — Proximal Policy Optimization Algorithms. arXiv
- Sutton & Barto (2018) — Reinforcement Learning: An Introduction. MIT Press
- Williams (1992) — Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning
Frequently Asked Questions
What is the Markov Decision Process formalism?
An MDP is a tuple (S, A, P, R, γ) where S is a state space, A an action space, P(s'|s,a) a transition function, R(s,a) a reward function, and γ ∈ [0,1] a discount factor. An agent observes state s, takes action a, receives reward r, transitions to s', and repeats. The goal is to find a policy π(a|s) that maximizes expected discounted return E[Σ γ^t R_t]. In language model RL: states are token sequences so far, actions are next tokens, reward is the preference score from the reward model.
Why does vanilla policy gradient have high variance and how does PPO fix it?
The REINFORCE gradient estimator ∇J(θ) = E[G_t · ∇log π(a|s)] is unbiased but has high variance because G_t (return) can be large and noisy. PPO addresses this with: (1) advantage estimation using a learned value function V(s) as a baseline (A = G_t - V(s_t)); (2) clipped surrogate objective that bounds the policy update ratio r(θ) = π_θ(a|s)/π_old(a|s) to [1-ε, 1+ε]; (3) multiple gradient steps per rollout batch with early stopping. The clipping prevents catastrophically large policy updates that destabilize training.
What is the credit assignment problem in RL for language models?
In RLHF, a reward signal (human preference score) is given for an entire generated response (often 50–500 tokens). The credit assignment problem asks: which tokens in the response caused the high/low reward? With γ=1 and terminal reward, all tokens in the sequence receive the same return, making it difficult to identify which specific word choices were good or bad. This is why RLHF training is less sample-efficient than supervised learning — the reward signal is sparse and temporally delayed.