Spinning Up by OpenAI
Some note from OpenAI Spinning Up
Kinds of RL Algorithms
Key Algorithms
Vanilla Policy Gradient
The key idea underlying policy gradient is to push up the possibilities of actions that lead to higher return, and push down the possibilities of actions that lead to lower return, until you arrive at the optimal policy.
Quick Facts
on-policy
discrete or continuous action spaces
Key Equations
The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance:
Exploration vs. Exploitation
VPG trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.
Pseudocode
References
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al. 2000
timeless classic of RL theory
contains references to the earlier work which led to modern policy gradients.
Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs, Schulman 2016(a)
chapter 2 contains a lucid introduction to the theory of policy gradient algorithms, including pseudocode.
Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al. 2016
recent benchmark paper that shows how vanilla policy gradient in the deep RL setting(eg. with neural network policies and Adam as the optimizer) compares with other deep RL algorithms.
High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016(b)
the implementation VPG makes use of Generalized Advantage Estimation for computing the policy gradient.
Trust Region Policy Optimization
TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of KL-Divergence, a measure of distance between probability distributions.
This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance -- so a single bad step can collapse the policy performance. This make it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sampling efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.
Quick Facts
on-policy
discrete or continuous action spaces
Key Equations
resulting in an approximate optimization problem,
This approximate problem can be analytically solved by the methods of Lagrangian duality, yielding the solution:
If we were to stop here, and just use this final result, the algorithm would be exactly calculating the Natural Policy Gradient. A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,
which gives s the correct output without computing the whole matrix.
Exploration vs. Exploitation
TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.
Pseudocode
References
Trust Region Policy Optimization, Schulman et al. 2015
original paper describing TRPO
High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016
Generalized Advantage Estimation
Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002
contains theoretical results which motivate and deeply connect to the theoretical foundations of TRPO.
Proximal Policy Optimization
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
References
Deep Deterministic Policy Gradient
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
References
Twin Delayed DDPG
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
References
Soft Actor-Critic
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
References
Last updated