But why so many RL algorithms ?

Different tradeoffs

Sample efficiency

Sample efficiency
  • How many samples do we need to get a good policy ?
  • Off-policy or On-policy ?
    • Off-policy: able to improve the policy without generating new samples from that policy.
    • On-policy: each time the policy is changes, even a little bit, we need to generate new samples.
off/on -policy
But why we would use a less efficient algorithm?
Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.

Stability & ease of use

Converge ? Converge to what ? Converge every time ?
Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...
  • Policy gradient
    • The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient.
  • Value function fitting
    • At best, minimizes error of fit ("Bellman error", not the same as expected reward)
    • At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case.
  • Model-based RL
    • Model minimizes error of fit, which will converge.
    • But no guarantee that better model is better policy.

Different assumptions

Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon
  1. 1.
    Full observability
    • Generally assumed by value function fitting methods
    • Can be mitigated by adding recurrence
  2. 2.
    Episodic learning
    • Often assumed by pure policy gradient methods
    • Assumed by some model-based RL methods
  3. 3.
    Continuity or smoothness
    • Assumed by some continuous value function learning methods
    • Often assumed by some model-based RL methods

Different things are easy or hard in different settings

  • Easier to represent the policy
  • Easier to represent the model
Last modified 2yr ago