Correlated samples and unstable target

From the online Q-iteration algorithm, we could find two apparent problems. The first is that the samples, which are observed using some policy, are correlated. Why this is a problem will be discussed later. And the second is that the target value is always changing.

Take sine function as an example, we will never regress to it if we observe the curve through time (x-axis), since the sequential points are strongly correlated and the target value is always changing.

Correlated samples

Solution 1.

We need samples in independently and identically distributed way.

Solution 2.

As any policies can be used to get samples(off-policy), we can store samples in a buffer and then load from it, which may be i.i.d. enough. This method is called replay buffer.

full Q-learning with replay buffer
repeat until converge:
====1: collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy, add it to $\mathcal{B}$
==== repeat K times: (K=1 is common,though large K is more efficient)
======== 2: sample a batch $(s_i,a_i,s'_i,r_i)$ from $\mathcal{B}$
========3: $\phi\leftarrow \phi -\alpha \sum_i \frac{dQ_\phi(s_i,a_i)}{d\phi}\left(Q_\phi(s_i,a_i)-\left[r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)\right] \right)$

+: samples are no longer correlated

+: multiple samples in the batch (low-variance gradient)

Unstable target

In online Q-learning algorithm, the target is changed every step, which may cause instability compared to regression problem in supervised learning.

Q-learning with replay buffer and target network:
repeat until converge:
==== 1: save target network parameter: $\phi'\leftarrow \phi$
====repeat N times:
========2: collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy, add it to $\mathcal{B}$
========repeat K times:
============ 3: sample a batch $(s_i,a_i,s'_i,r_i)$ from $\mathcal{B}$
============4: $\phi\leftarrow \phi -\alpha \sum_i \frac{dQ_\phi(s_i,a_i)}{d\phi}\left(Q_\phi(s_i,a_i)-\left[r(s_i,a_i)+\gamma \max_{a'}Q_{\phi'}(s'_i,a'_i)\right] \right)$

Notice that at line 4 the target value is now using $\phi'$ instead of $\phi$ .

The purpose is to make the target unchanged in a smaller loop, which is more like a standard regression procedure and is more stable. However, the price is that everything takes longer to finish than before.

PreviousDeep RL with Q-Function NextThe accuracy of Q-function

Last updated 5 years ago

Was this helpful?