Q iteration

Fitted Q-iteration

What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V

Fitted Q-iteration algorithm:

repeat until converge:

====1: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

====2: set ϕargminϕ12iQϕ(si,ai)yi2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2

+: works even for off-policy samples (unlike actor-critic)

+: only one network, no high-variance policy gradient

-: no convergence guarantees for non-linear function approximation

Full fitted Q-iteration algorithm:

repeat until converge:

====1: collect dataset {(si,ai,si,ri}\{(s_i,a_i,s'_i,r_i\} using some policy; (dataset size N, collection policy)

====repeat K times; (iterations K)

======== 2: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

========3: set ϕargminϕ12iQϕ(si,ai)yi2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2 ; (gradient steps S)

Why off-policy

Given ss and aa, transition is independent of π\pi

Optimize what

In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as β\beta, we have error:

E=12E(s,a)β[Qϕ(s,a)[r(s,a)+γmaxaQϕ(s,a)]]\mathcal{E}=\frac{1}{2}\mathbb{E}_{(s,a)\sim\beta}\left[Q_\phi(s,a)-\left[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')\right] \right]

If E=0\mathcal{E}=0, then Qϕ(s,a)=r(s,a)+γmaxaQϕ(s,a)Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a'), which is the optimal Q-function, corresponding to optimal policy π\pi'. But when we leave the tabular case and use neural network function approximation, most guarantee are lost.

Online Q-iteration

Set N=1,K=1N=1,K=1

Online Q-iteration algorithm:

repeat until converge:

====1: collect dataset {(si,ai,si,ri}\{(s_i,a_i,s'_i,r_i\} using some policy; (collection policy)

====2: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

====3: set ϕϕαdQϕd(si,ai)(Qϕ(si,ai)yi)\phi\leftarrow \phi-\alpha\frac{dQ_\phi}{d}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i) ; (gradient steps S)

Exploration

Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:

epsilon-greedy:

π(atst)={1ϵif at=argmaxatQϕ(st,at)ϵ/(A1)otherwise\pi(a_t|s_t)= \begin{cases} 1-\epsilon &\text{if }a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)\\ \epsilon/(|\mathcal{A}|-1)&\text{otherwise} \end{cases}

Boltzmann exploration:

π(atst)exp(Qϕ(st,at))\pi(a_t|s_t)\propto \exp(Q_\phi(s_t,a_t))

Last updated