# Comparison

But why so many RL algorithms ?

## Different tradeoffs

### Sample efficiency

![Sample efficiency](https://4133958719-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LigLKy0c06y4iTEtrkI%2F-LjV3u5GCLvDVmHuN3ws%2F-LjV3ur8Fk-wiMRh-39V%2F1562830551858.png?generation=1562832516019115\&alt=media)

* How many samples do we need to get a good policy ?
* Off-policy or On-policy ?
  * Off-policy: able to improve the policy without generating new samples from **that policy**.
  * On-policy: each time the policy is changes, even a little bit, we need to generate new samples.

![off/on -policy](https://4133958719-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LigLKy0c06y4iTEtrkI%2F-LjV3u5GCLvDVmHuN3ws%2F-LjV3urBdDMbyKHqd35v%2F1562830786663.png?generation=1562832515888584\&alt=media)

But why we would use a less efficient algorithm?

Because sample efficiency is not the only measurement for a RL algorithm, and perhaps the less efficient algorithms are quicker -- wall clock time is not the same as efficiency.

### Stability & ease of use

Converge ? Converge to what ? Converge every time ?

Supervised learning is almost always gradient descent, but RL is often not gradient descent. For example, Q-learning is fixed point iteration. And ...

* Policy gradient
  * The only one that actually performs gradient descent (ascent) on the true objective, but also often the least efficient.
* Value function fitting
  * At best, minimizes error of fit ("Bellman error", not the same as expected reward)
  * At worst, doesn't optimize anything, not guaranteed to converge to anything in the nonlinear case.
* Model-based RL
  * Model minimizes error of fit, which will converge.
  * But no guarantee that better model **is** better policy.

## Different assumptions

Stochastic or deterministic/Continuous or discrete/Episodic or infinite horizon

1. Full observability
   * Generally assumed by value function fitting methods
   * Can be mitigated by adding recurrence
2. Episodic learning
   * Often assumed by pure policy gradient methods
   * Assumed by some model-based RL methods
3. Continuity or smoothness
   * Assumed by some continuous value function learning methods
   * Often assumed by some model-based RL methods

## Different things are easy or hard in different settings

* Easier to represent the policy
* Easier to represent the model


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://drdh.gitbook.io/rl/deep-rl-course/intro-to-rl/tradeoffs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
