KLip-PPOA per-sample KL perspective on PPO-Clip

* equal contribution
KLip-PPO

PPO-Clip is a KL penalty whose coefficient varies per sample.

The clipped surrogate and a KL penalty are treated as separate algorithms. We show the clipped gradient is reproduced exactly by a KL penalty whose coefficient varies per sample, so the two produce indistinguishable training curves:

CartPole-v1
real logged returns

Proximal Policy Optimization (Schulman et al., 2017) is the default algorithm for on-policy reinforcement learning. Each update maximises an importance-weighted return while keeping the new policy \(\pi_{\theta'}\) close to the policy \(\pi_\theta\) that collected the data. Each sampled action carries an importance ratio \(w_t=\pi_{\theta'}(a_t\mid s_t)/\pi_\theta(a_t\mid s_t)\), measuring how much the policy changed on it, and an advantage \(\hat A_t\) estimating how much better the action was than expected. PPO keeps \(w_t\) close to one in one of two ways.

Public run histories, configs, logs, metrics, and checkpoints are available as reproducibility artifacts in the KLip-PPO W&B project.

// 01  the clip and the KL penalty

PPO-Clip caps the ratio: once \(w_t\) leaves the band \([1-\epsilon,\,1+\epsilon]\) in the advantage-improving direction, the objective stops rewarding that sample.

PPO-KL instead subtracts a penalty proportional to the Kullback–Leibler divergence between the two policies, paying for every unit of drift.

the clipped surrogate rises then flattens once the ratio leaves the band
PPO-Clip: the surrogate rises with the ratio \(w_t\), then flattens once \(w_t\) leaves the band, so past the edge the gradient is clipped.PPO-Clip: past the band edge, the gradient is clipped.
two policy distributions separated by a Kullback-Leibler divergence
PPO-KL: the penalty grows with the divergence \(D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi_{\theta'})\), pulling \(\pi_{\theta'}\) back toward \(\pi_\theta\).PPO-KL: the penalty grows with \(D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi_{\theta'})\).

Since 2017 these have been read as separate algorithms, with their own gradients and hyperparameters, and a large literature compares them task by task. As training loops they look unrelated:

PPO-Clip training loop

They are not alternatives. The clip is itself a KL penalty, applied one sample at a time.

// 02  the per-sample identity

Fix one update. Whether the clip changes a sample's gradient depends only on its ratio \(w_t\) and advantage \(\hat A_t\), which split the batch into three disjoint sets:

  • in-band (\(\mathcal I_{\mathrm{in}}\), \(w_t\in[1-\epsilon,1+\epsilon]\)): the clip is inactive; the gradient is the ordinary policy gradient.
  • kill (\(\mathcal I_{\mathrm{kill}}\)): \(w_t\) is outside the band in the advantage-improving direction, so the clip zeroes the gradient and the sample stops contributing.
  • pass (\(\mathcal I_{\mathrm{pass}}\)): \(w_t\) is outside the band but the move is corrective, so the unclipped term stays active.

So the clip's only effect is to zero the gradient on the kill set. A per-sample KL penalty with coefficient \(\beta_t=-w_t\hat A_t\) on the kill set (and \(0\) elsewhere) zeroes exactly those gradients and leaves the rest untouched; added to the unpenalised surrogate it reproduces the PPO-Clip gradient sample for sample:

Theorem 1 · per-sample gradient identity
\[ \nabla_{\theta'} L_{\mathrm{CLIP}} \;=\; \nabla_{\theta'}\,\mathbb{E}_t\!\big[\, w_t\hat A_t \;+\; \beta_t \,\log \pi_{\theta'}(a_t\mid s_t) \,\big], \qquad \beta_t = -\,w_t\hat A_t \,\mathbb{1}\!\left[t\in\mathcal I_{\mathrm{kill}}\right]. \]
the coefficient beta over the plane of ratio w and advantage A
Where the clip acts. The coefficient \(\beta_t=-w\hat A\) lives on the two kill corners (red); in the band and the pass corners \(\beta_t=0\).

The match is exact and global: it holds at every parameter setting \(\theta'\) and across the whole inner loop, while the original PPO paper argued only a first-order agreement near \(\theta_{\mathrm{old}}\). Statement and proof →

// 03  learning curves

Because they share a gradient, the clip and the per-sample KL surrogate train the same way. The two panels below show the logged returns for PPO-Clip and per-sample KL (mean over five seeds, on a shared scale); the curves stay together on every task.

The scalar baselines and the full four-way comparison are on the experiments page.

// 04  the same gradient, three forms

The identity lets PPO-Clip be written three ways that compute the same gradient but place the penalty in different spaces. All three share the reward term \(w_t\hat A_t\) and differ only in how they express the penalty: hidden in a \(\min\), on the importance weight, or on the policy.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\min(\,w_t\hat A_t,\ \mathrm{clip}(w_t,1-\epsilon,1+\epsilon)\,\hat A_t\,)\big] \]

Schulman et al., 2017. The \(\min\) hides the penalty; only its effect is visible.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\,w_t\hat A_t - \Phi(w_t,\hat A_t)\,\big] \]

A dual form (Appendix A): the penalty \(\Phi\) subtracts how far \(w_t\) overshoots the band, on the kill samples.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\,w_t\hat A_t + \beta_t\,\log\pi_{\theta'}(a_t\mid s_t)\,\big],\quad \beta_t=-w_t\hat A_t\,\mathbb{1}[\mathcal I_{\mathrm{kill}}] \]

The per-sample KL form. PPO-Clip fixes \(\beta_t\) to a step on the boundary; the per-sample view makes that step an explicit choice.

// 05  consequences

The identity removes the opposition between the two surrogates. PPO-Clip is a KL penalty, with the per-sample coefficient \(\beta_t\); so whatever edge clipping has shown over PPO-KL is an edge over a single scalar coefficient, which our experiments confirm on the high-dimensional tasks.

That reframes the open question as the shape of \(\beta_t\). The clip fixes it to a hard step on the trust-region boundary; soft relaxations and asymmetric or position-aware coefficients are other choices within the same surrogate family, now reachable as design knobs rather than separate algorithms.

References

  1. Schulman et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  2. Schulman et al. (2015). Trust Region Policy Optimization. ICML. arXiv:1502.05477.
  3. Engstrom et al. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. ICLR. arXiv:2005.12729.
  4. Ilyas et al. (2020). A Closer Look at Deep Policy Gradients. ICLR. arXiv:1811.02553.
  5. Andrychowicz et al. (2021). What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. ICLR. arXiv:2006.05990.
  6. Hsu et al. (2020). Revisiting Design Choices in Proximal Policy Optimization. arXiv:2009.10897.
  7. Sun et al. (2022). You May Not Need Ratio Clipping in PPO. arXiv:2202.00079.