Q iteration
Fitted Q-iteration
What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V
Fitted Q-iteration algorithm:
repeat until converge:
====1: set
====2: set
+: works even for off-policy samples (unlike actor-critic)
+: only one network, no high-variance policy gradient
-: no convergence guarantees for non-linear function approximation
Full fitted Q-iteration algorithm:
repeat until converge:
====1: collect dataset using some policy; (dataset size N, collection policy)
====repeat K times; (iterations K)
======== 2: set
========3: set ; (gradient steps S)
Why off-policy
Given and , transition is independent of
Optimize what
In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as , we have error:
If , then , which is the optimal Q-function, corresponding to optimal policy . But when we leave the tabular case and use neural network function approximation, most guarantee are lost.
Online Q-iteration
Set
Online Q-iteration algorithm:
repeat until converge:
====1: collect dataset using some policy; (collection policy)
====2: set
====3: set ; (gradient steps S)
Exploration
Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:
epsilon-greedy:
Boltzmann exploration:
Last updated