Q iteration

Fitted Q-iteration

What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V

Fitted Q-iteration algorithm:
repeat until converge:
====1: set $y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)$
====2: set $\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2$

+: works even for off-policy samples (unlike actor-critic)

+: only one network, no high-variance policy gradient

-: no convergence guarantees for non-linear function approximation

Full fitted Q-iteration algorithm:
repeat until converge:
====1: collect dataset $\{(s_i,a_i,s'_i,r_i\}$ using some policy; (dataset size N, collection policy)
====repeat K times; (iterations K)
======== 2: set $y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)$
========3: set $\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2$ ; (gradient steps S)

Why off-policy

Given $s$ and $a$ , transition is independent of $\pi$

Optimize what

In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as $\beta$ , we have error:

\mathcal{E}=\frac{1}{2}\mathbb{E}_{(s,a)\sim\beta}\left[Q_\phi(s,a)-\left[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')\right] \right]

If $\mathcal{E}=0$ , then $Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a')$ , which is the optimal Q-function, corresponding to optimal policy $\pi'$ . But when we leave the tabular case and use neural network function approximation, most guarantee are lost.

Online Q-iteration

Set $N=1,K=1$

Online Q-iteration algorithm:
repeat until converge:
====1: collect dataset $\{(s_i,a_i,s'_i,r_i\}$ using some policy; (collection policy)
====2: set $y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)$
====3: set $\phi\leftarrow \phi-\alpha\frac{dQ_\phi}{d}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i)$ ; (gradient steps S)

Exploration

Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:

epsilon-greedy:

\pi(a_t|s_t)= \begin{cases} 1-\epsilon &\text{if }a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)\\ \epsilon/(|\mathcal{A}|-1)&\text{otherwise} \end{cases}

Boltzmann exploration:

\pi(a_t|s_t)\propto \exp(Q_\phi(s_t,a_t))

PreviousValue iteration NextLearning theory

Last updated 5 years ago

Was this helpful?