Comparison
But why so many RL algorithms ?
Different tradeoffs
Sample efficiency

- How many samples do we need to get a good policy ? 
- Off-policy or On-policy ? - Off-policy: able to improve the policy without generating new samples from that policy. 
- On-policy: each time the policy is changes, even a little bit, we need to generate new samples. 
 

But why we would use a less efficient algorithm?
Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.
Stability & ease of use
Converge ? Converge to what ? Converge every time ?
Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...
- Policy gradient - The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient. 
 
- Value function fitting - At best, minimizes error of fit ("Bellman error", not the same as expected reward) 
- At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case. 
 
- Model-based RL - Model minimizes error of fit, which will converge. 
- But no guarantee that better model is better policy. 
 
Different assumptions
Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon
- Full observability - Generally assumed by value function fitting methods 
- Can be mitigated by adding recurrence 
 
- Episodic learning - Often assumed by pure policy gradient methods 
- Assumed by some model-based RL methods 
 
- Continuity or smoothness - Assumed by some continuous value function learning methods 
- Often assumed by some model-based RL methods 
 
Different things are easy or hard in different settings
- Easier to represent the policy 
- Easier to represent the model 
Last updated
Was this helpful?