RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Kinds of RL Algorithms
  • Key Algorithms
  • Vanilla Policy Gradient
  • Trust Region Policy Optimization
  • Proximal Policy Optimization
  • Deep Deterministic Policy Gradient
  • Twin Delayed DDPG
  • Soft Actor-Critic

Was this helpful?

  1. Resources

Spinning Up by OpenAI

PreviousResourcesNextPolicy gradients

Last updated 5 years ago

Was this helpful?

Some note from

Kinds of RL Algorithms

Key Algorithms

Vanilla Policy Gradient

The key idea underlying policy gradient is to push up the possibilities of actions that lead to higher return, and push down the possibilities of actions that lead to lower return, until you arrive at the optimal policy.

Quick Facts

  • on-policy

  • discrete or continuous action spaces

Key Equations

The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance:

Exploration vs. Exploitation

VPG trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

Pseudocode

References

    • timeless classic of RL theory

    • contains references to the earlier work which led to modern policy gradients.

    • chapter 2 contains a lucid introduction to the theory of policy gradient algorithms, including pseudocode.

    • recent benchmark paper that shows how vanilla policy gradient in the deep RL setting(eg. with neural network policies and Adam as the optimizer) compares with other deep RL algorithms.

    • the implementation VPG makes use of Generalized Advantage Estimation for computing the policy gradient.

Trust Region Policy Optimization

TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of KL-Divergence, a measure of distance between probability distributions.

This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance -- so a single bad step can collapse the policy performance. This make it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sampling efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.

Quick Facts

  • on-policy

  • discrete or continuous action spaces

Key Equations

resulting in an approximate optimization problem,

This approximate problem can be analytically solved by the methods of Lagrangian duality, yielding the solution:

If we were to stop here, and just use this final result, the algorithm would be exactly calculating the Natural Policy Gradient. A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,

which gives s the correct output without computing the whole matrix.

Exploration vs. Exploitation

TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

Pseudocode

References

    • original paper describing TRPO

    • Generalized Advantage Estimation

    • contains theoretical results which motivate and deeply connect to the theoretical foundations of TRPO.

Proximal Policy Optimization

Quick Facts

Key Equations

Exploration vs. Exploitation

Pseudocode

References

Deep Deterministic Policy Gradient

Quick Facts

Key Equations

Exploration vs. Exploitation

Pseudocode

References

Twin Delayed DDPG

Quick Facts

Key Equations

Exploration vs. Exploitation

Pseudocode

References

Soft Actor-Critic

Quick Facts

Key Equations

Exploration vs. Exploitation

Pseudocode

References

Let πθ\pi_\thetaπθ​ denote a policy with parameters θ\thetaθ, and J(πθ)J(\pi_\theta)J(πθ​) denote the expected finite-horizon undiscounted return of the policy. The gradient of J(πθ)J(\pi_\theta)J(πθ​) is

∇θJ(πθ)=Eτ∼πθ[∑t=0T∇θlog⁡πθ(at∣st)Aπθ(st,at)],\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}{ \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t) \right] },∇θ​J(πθ​)=Eτ∼πθ​​[t=0∑T​∇θ​logπθ​(at​∣st​)Aπθ​(st​,at​)],

where τ\tauτ is a trajectory and AπθA^{\pi_\theta}Aπθ​ is the advantage function for the current policy.

θk+1=θk+α∇θJ(πθk)\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})θk+1​=θk​+α∇θ​J(πθk​​)

, Sutton et al. 2000

, Schulman 2016(a)

, Duan et al. 2016

, Schulman et al. 2016(b)

Let πθ\pi_\thetaπθ​ denote a policy with parameter θ\thetaθ. The theoretical TRPO update is:

θk+1=arg⁡max⁡θ  L(θk,θ)s.t.  DˉKL(θ∣∣θk)≤δ\begin{align} \theta_{k+1} = \arg \max_{\theta} \; & {\mathcal L}(\theta_k, \theta) \\ \text{s.t.} \; & \bar{D}_{KL}(\theta || \theta_k) \leq \delta \end{align}θk+1​=argθmax​s.t.​L(θk​,θ)DˉKL​(θ∣∣θk​)≤δ​​

where L(θk,θ)\mathcal{L}(\theta_k,\theta)L(θk​,θ) is the surrogate advantage. a measure of how policy πθ\pi_\thetaπθ​ performs relative to the old policy πθk\pi_{\theta_k}πθk​​ using data from the old policy:

L(θk,θ)=Es,a∼πθk[πθ(a∣s)πθk(a∣s)Aπθk(s,a)],{\mathcal L}(\theta_k, \theta) = E_{s,a \sim \pi_{\theta_k}}{\left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a) \right] },L(θk​,θ)=Es,a∼πθk​​​[πθk​​(a∣s)πθ​(a∣s)​Aπθk​​(s,a)],

and DˉKL(θ∣∣θk)\bar{D}_{KL}(\theta||\theta_k)DˉKL​(θ∣∣θk​) is an average KL-divergence between policies across states visited by the old policy:

DˉKL(θ∣∣θk)=Es∼πθk[DKL(πθ(⋅∣s)∣∣πθk(⋅∣s))].\bar{D}_{KL}(\theta || \theta_k) = E_{s \sim \pi_{\theta_k}}{ \left[ D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right)\right] }.DˉKL​(θ∣∣θk​)=Es∼πθk​​​[DKL​(πθ​(⋅∣s)∣∣πθk​​(⋅∣s))].

Notice that the objective and constraint are both zero when θ=θk\theta=\theta_kθ=θk​. Furthermore, the gradient of the constraint with respect to θ\thetaθ is zero when θ=θk\theta=\theta_kθ=θk​.

The theoretical TRPO update isn't the easiest to work with, so TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around θk\theta_kθk​:

L(θk,θ)≈gT(θ−θk)DˉKL(θ∣∣θk)≈12(θ−θk)TH(θ−θk)\begin{align} {\mathcal L}(\theta_k, \theta) &\approx g^T (\theta - \theta_k) \\ \bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \end{align}L(θk​,θ)DˉKL​(θ∣∣θk​)​≈gT(θ−θk​)≈21​(θ−θk​)TH(θ−θk​)​​
θk+1=arg⁡max⁡θ  gT(θ−θk)s.t.  12(θ−θk)TH(θ−θk)≤δ.\begin{align} \theta_{k+1} = \arg \max_{\theta} \; & g^T (\theta - \theta_k) \\ \text{s.t.} \; & \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta. \end{align}θk+1​=argθmax​s.t.​gT(θ−θk​)21​(θ−θk​)TH(θ−θk​)≤δ.​​

Notice that the gradient ggg of the surrogate advantage function with respect to θ\thetaθ, evaluate at θ=θk\theta=\theta_kθ=θk​, is exactly equal to the policy gradient, ∇θJ(πθ)\nabla_\theta J(\pi_\theta)∇θ​J(πθ​).

θk+1=θk+2δgTH−1gH−1g.\theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.θk+1​=θk​+gTH−1g2δ​​H−1g.
θk+1=θk+αj2δgTH−1gH−1g,\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,θk+1​=θk​+αjgTH−1g2δ​​H−1g,

where α∈(0,1)\alpha\in (0,1)α∈(0,1) is the backtracking coefficient, and jjj is the smallest nonnegative integer such that πθk+1\pi_{\theta_{k+1}}πθk+1​​ satisfies the KL constraint and produces a positive surrogate advantage.

Lastly: computing and storing the matrix inverse, H−1H^{-1}H−1, is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the conjugate gradient algorithm to solve Hx=gHx=gHx=g for x=H−1gx=H^{-1}gx=H−1g, requiring only a function which can compute the matrix-vector product HxHxHx instead of computing and storing the whole matrix HHH directly. This is not too hard to do: we set up a symbolic operation to calculate

Hx=∇θ((∇θDˉKL(θ∣∣θk))Tx),Hx = \nabla_{\theta} \left( \left(\nabla_{\theta} \bar{D}_{KL}(\theta || \theta_k)\right)^T x \right),Hx=∇θ​((∇θ​DˉKL​(θ∣∣θk​))Tx),

, Schulman et al. 2015

, Schulman et al. 2016

, Kakade and Langford 2002

Policy Gradient Methods for Reinforcement Learning with Function Approximation
Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs
Benchmarking Deep Reinforcement Learning for Continuous Control
High Dimensional Continuous Control Using Generalized Advantage Estimation
Trust Region Policy Optimization
High Dimensional Continuous Control Using Generalized Advantage Estimation
Approximately Optimal Approximate Reinforcement Learning
OpenAI Spinning Up
A Taxonomy of RL Algorithms
VPG Algorithms
TRPO Algorithm