RL

Searchâ€¦

Deep RL Course

Related Papers

Resources

RL in practice

Comparison

But why so many RL algorithms ?

Different tradeoffs

Sample efficiency

Sample efficiency

- How many samples do we need to get a good policy ?
- Off-policy or On-policy ?
- Off-policy: able to improve the policy without generating new samples from
**that policy**. - On-policy: each time the policy is changes, even a little bit, we need to generate new samples.

off/on -policy

But why we would use a less efficient algorithm?

Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.

Stability & ease of use

Converge ? Converge to what ? Converge every time ?

Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...

- Policy gradient
- The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient.

- Value function fitting
- At best, minimizes error of fit ("Bellman error", not the same as expected reward)
- At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case.

- Model-based RL
- Model minimizes error of fit, which will converge.
- But no guarantee that better model
**is**better policy.

Different assumptions

Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon

- 1.Full observability
- Generally assumed by value function fitting methods
- Can be mitigated by adding recurrence

- 2.Episodic learning
- Often assumed by pure policy gradient methods
- Assumed by some model-based RL methods

- 3.Continuity or smoothness
- Assumed by some continuous value function learning methods
- Often assumed by some model-based RL methods

Different things are easy or hard in different settings

- Easier to represent the policy
- Easier to represent the model

Last modified 2yr ago