RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Different tradeoffs
  • Sample efficiency
  • Stability & ease of use
  • Different assumptions
  • Different things are easy or hard in different settings

Was this helpful?

  1. Deep RL Course
  2. Intro to RL

Comparison

PreviousTypes of RL algorithmsNextPolicy Gradient

Last updated 5 years ago

Was this helpful?

But why so many RL algorithms ?

Different tradeoffs

Sample efficiency

  • How many samples do we need to get a good policy ?

  • Off-policy or On-policy ?

    • Off-policy: able to improve the policy without generating new samples from that policy.

    • On-policy: each time the policy is changes, even a little bit, we need to generate new samples.

But why we would use a less efficient algorithm?

Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.

Stability & ease of use

Converge ? Converge to what ? Converge every time ?

Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...

  • Policy gradient

    • The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient.

  • Value function fitting

    • At best, minimizes error of fit ("Bellman error", not the same as expected reward)

    • At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case.

  • Model-based RL

    • Model minimizes error of fit, which will converge.

    • But no guarantee that better model is better policy.

Different assumptions

Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon

  1. Full observability

    • Generally assumed by value function fitting methods

    • Can be mitigated by adding recurrence

  2. Episodic learning

    • Often assumed by pure policy gradient methods

    • Assumed by some model-based RL methods

  3. Continuity or smoothness

    • Assumed by some continuous value function learning methods

    • Often assumed by some model-based RL methods

Different things are easy or hard in different settings

  • Easier to represent the policy

  • Easier to represent the model

Sample efficiency
off/on -policy