Question 1

What is the Markov Decision Process formalism?

Accepted Answer

An MDP is a tuple (S, A, P, R, γ) where S is a state space, A an action space, P(s'|s,a) a transition function, R(s,a) a reward function, and γ ∈ [0,1] a discount factor. An agent observes state s, takes action a, receives reward r, transitions to s', and repeats. The goal is to find a policy π(a|s) that maximizes expected discounted return E[Σ γ^t R_t]. In language model RL: states are token sequences so far, actions are next tokens, reward is the preference score from the reward model.

Question 2

Why does vanilla policy gradient have high variance and how does PPO fix it?

Accepted Answer

The REINFORCE gradient estimator ∇J(θ) = E[G_t · ∇log π(a|s)] is unbiased but has high variance because G_t (return) can be large and noisy. PPO addresses this with: (1) advantage estimation using a learned value function V(s) as a baseline (A = G_t - V(s_t)); (2) clipped surrogate objective that bounds the policy update ratio r(θ) = π_θ(a|s)/π_old(a|s) to [1-ε, 1+ε]; (3) multiple gradient steps per rollout batch with early stopping. The clipping prevents catastrophically large policy updates that destabilize training.

Question 3

What is the credit assignment problem in RL for language models?

Accepted Answer

In RLHF, a reward signal (human preference score) is given for an entire generated response (often 50–500 tokens). The credit assignment problem asks: which tokens in the response caused the high/low reward? With γ=1 and terminal reward, all tokens in the sequence receive the same return, making it difficult to identify which specific word choices were good or bad. This is why RLHF training is less sample-efficient than supervised learning — the reward signal is sparse and temporally delayed.

Measure	Value	Unit	Notes
PPO clip parameter ε	0.2		Schulman et al. default; clips policy ratio r(θ) to [1-ε, 1+ε] to prevent large updates
REINFORCE variance reduction	Baseline subtraction		Subtracting a state-dependent baseline (value function) reduces gradient variance without bias
Discount factor γ in language RL	1.0		RLHF pipelines typically use γ=1 since reward is only given at end of sequence (episodic)
PPO rollout buffer size	2048–8192	tokens/steps	Typical RLHF implementations collect this many response tokens before each gradient update
KL penalty coefficient β	0.01–0.1		β scales the KL divergence from the reference policy in the RLHF reward: R = r_θ − β·KL

MDP Concept	Language Model Equivalent
State s	Token sequence generated so far
Action a	Next token to generate (vocab size ~50K)
Transition P(s’\|s,a)	Deterministic: append a to s
Reward R(s,a)	Preference score at end of response
Policy π(a\|s)	Language model (softmax output)
Episode	One complete prompt-response pair

r_t(θ) value	A_t > 0 (good action)	A_t < 0 (bad action)
r_t = 1.0	Gradient applies	Gradient applies
r_t = 1.3 (ε=0.2)	Clipped (no more update)	Not clipped
r_t = 0.7 (ε=0.2)	Not clipped	Clipped (no more update)

β value	Effect
β = 0	No KL constraint; policy collapses to reward hacking
β = 0.01	Weak regularization; allows large policy changes
β = 0.1	Standard; balances reward and policy stability
β = 1.0	Strong constraint; limits adaptation to reference

Reinforcement Learning Basics: MDPs, Policy Gradients, and PPO

The MDP Framework for Language Models

Policy Gradient Theorem

Advantage Estimation

PPO: Proximal Policy Optimization

RLHF-Specific Reward Formulation

Related Pages

Sources

Frequently Asked Questions

What is the Markov Decision Process formalism?

Why does vanilla policy gradient have high variance and how does PPO fix it?

What is the credit assignment problem in RL for language models?