Correlated samples and unstable target
Last updated
Last updated
From the online Q-iteration algorithm, we could find two apparent problems. The first is that the samples, which are observed using some policy, are correlated. Why this is a problem will be discussed later. And the second is that the target value is always changing.
Take sine function as an example, we will never regress to it if we observe the curve through time (x-axis), since the sequential points are strongly correlated and the target value is always changing.
We need samples in independently and identically distributed way.
As any policies can be used to get samples(off-policy), we can store samples in a buffer and then load from it, which may be i.i.d. enough. This method is called replay buffer.
full Q-learning with replay buffer
repeat until converge:
==== repeat K times: (K=1 is common,though large K is more efficient)
+: samples are no longer correlated
+: multiple samples in the batch (low-variance gradient)
In online Q-learning algorithm, the target is changed every step, which may cause instability compared to regression problem in supervised learning.
Q-learning with replay buffer and target network:
repeat until converge:
====repeat N times:
========repeat K times:
The purpose is to make the target unchanged in a smaller loop, which is more like a standard regression procedure and is more stable. However, the price is that everything takes longer to finish than before.
====1: collect dataset using some policy, add it to
======== 2: sample a batch from
========3:
==== 1: save target network parameter:
========2: collect dataset using some policy, add it to
============ 3: sample a batch from
============4:
Notice that at line 4 the target value is now using instead of .