Comparison
Last updated
Last updated
But why so many RL algorithms ?
How many samples do we need to get a good policy ?
Off-policy or On-policy ?
Off-policy: able to improve the policy without generating new samples from that policy.
On-policy: each time the policy is changes, even a little bit, we need to generate new samples.
But why we would use a less efficient algorithm?
Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.
Converge ? Converge to what ? Converge every time ?
Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...
Policy gradient
The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient.
Value function fitting
At best, minimizes error of fit ("Bellman error", not the same as expected reward)
At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case.
Model-based RL
Model minimizes error of fit, which will converge.
But no guarantee that better model is better policy.
Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon
Full observability
Generally assumed by value function fitting methods
Can be mitigated by adding recurrence
Episodic learning
Often assumed by pure policy gradient methods
Assumed by some model-based RL methods
Continuity or smoothness
Assumed by some continuous value function learning methods
Often assumed by some model-based RL methods
Easier to represent the policy
Easier to represent the model