Q iteration
Fitted Q-iteration
What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V
Fitted Q-iteration algorithm:
repeat until converge:
====1: set yi←r(si,ai)+γmaxa′Qϕ(si′,ai′)
====2: set ϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2
+: works even for off-policy samples (unlike actor-critic)
+: only one network, no high-variance policy gradient
-: no convergence guarantees for non-linear function approximation
Full fitted Q-iteration algorithm:
repeat until converge:
====1: collect dataset {(si,ai,si′,ri} using some policy; (dataset size N, collection policy)
====repeat K times; (iterations K)
======== 2: set yi←r(si,ai)+γmaxa′Qϕ(si′,ai′)
========3: set ϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2 ; (gradient steps S)
Why off-policy
Given s and a, transition is independent of π

Optimize what
In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as β, we have error:
If E=0, then Qϕ(s,a)=r(s,a)+γmaxa′Qϕ(s′,a′), which is the optimal Q-function, corresponding to optimal policy π′. But when we leave the tabular case and use neural network function approximation, most guarantee are lost.
Online Q-iteration
Set N=1,K=1
Online Q-iteration algorithm:
repeat until converge:
====1: collect dataset {(si,ai,si′,ri} using some policy; (collection policy)
====2: set yi←r(si,ai)+γmaxa′Qϕ(si′,ai′)
====3: set ϕ←ϕ−αddQϕ(si,ai)(Qϕ(si,ai)−yi) ; (gradient steps S)
Exploration
Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:
epsilon-greedy:
Boltzmann exploration:
Last updated