# Correlated samples and unstable target

From the online Q-iteration algorithm, we could find two apparent problems. The first is that the samples, which are observed using some policy, are correlated. Why this is a problem will be discussed later. And the second is that the target value is always changing.

![Sine function](https://4133958719-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LigLKy0c06y4iTEtrkI%2F-LoLI1uZI6IPZlnK6XB_%2F-LoLI3ySzR9nCq7g2Ads%2F1568034830764.png?generation=1568037164012496\&alt=media)

Take sine function as an example, we will never regress to it if we observe the curve through time (x-axis), since the sequential points are strongly correlated and the target value is always changing.

## Correlated samples

### Solution 1.

![parallel sampling](https://4133958719-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LigLKy0c06y4iTEtrkI%2F-LoLI1uZI6IPZlnK6XB_%2F-LoLI3yU96e85fMJv3EL%2F1568035054243.png?generation=1568037163869623\&alt=media)

We need samples in independently and identically distributed way.

### Solution 2.

As any policies can be used to get samples(off-policy), we can store samples in a buffer and then load from it, which may be i.i.d. enough. This method is called replay buffer.

![Replay Buffer](https://4133958719-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LigLKy0c06y4iTEtrkI%2F-LoLI1uZI6IPZlnK6XB_%2F-LoLI3yWex8VyHzTlKx2%2F1568035352124.png?generation=1568037163978667\&alt=media)

> full Q-learning with replay buffer
>
> repeat until converge:
>
> \====1: collect dataset $${(s\_i,a\_i,s'\_i,r\_i)}$$using some policy, add it to $$\mathcal{B}$$
>
> \==== repeat K times: (K=1 is common,though large K is more efficient)
>
> \======== 2: sample a batch $$(s\_i,a\_i,s'\_i,r\_i)$$ from $$\mathcal{B}$$
>
> \========3: $$\phi\leftarrow \phi -\alpha \sum\_i \frac{dQ\_\phi(s\_i,a\_i)}{d\phi}\left(Q\_\phi(s\_i,a\_i)-\left\[r(s\_i,a\_i)+\gamma \max\_{a'}Q\_\phi(s'\_i,a'\_i)\right] \right)$$

**+:** samples are no longer correlated

**+:** multiple samples in the batch (low-variance gradient)

## Unstable target

In online Q-learning algorithm, the target is changed every step, which may cause instability compared to regression problem in supervised learning.

> Q-learning with replay buffer and target network:
>
> repeat until converge:
>
> \==== 1: save target network parameter: $$\phi'\leftarrow \phi$$
>
> \====repeat N times:
>
> \========2: collect dataset $${(s\_i,a\_i,s'\_i,r\_i)}$$using some policy, add it to $$\mathcal{B}$$
>
> \========repeat K times:
>
> \============ 3: sample a batch $$(s\_i,a\_i,s'\_i,r\_i)$$ from $$\mathcal{B}$$
>
> \============4: $$\phi\leftarrow \phi -\alpha \sum\_i \frac{dQ\_\phi(s\_i,a\_i)}{d\phi}\left(Q\_\phi(s\_i,a\_i)-\left\[r(s\_i,a\_i)+\gamma \max\_{a'}Q\_{\phi'}(s'\_i,a'\_i)\right] \right)$$

Notice that at line 4 the target value is now using $$\phi'$$ instead of $$\phi$$.

The purpose is to make the target unchanged in a smaller loop, which is more like a standard regression procedure and is more stable. However, the price is that everything takes longer to finish than before.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://drdh.gitbook.io/rl/deep-rl-course/deep-rl-with-q-function/correlated-samples-and-unstable-target.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
