Q iteration

Fitted Q-iteration

What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V

Fitted Q-iteration algorithm:

repeat until converge:

====1: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

====2: set ϕargminϕ12iQϕ(si,ai)yi2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2

+: works even for off-policy samples (unlike actor-critic)

+: only one network, no high-variance policy gradient

-: no convergence guarantees for non-linear function approximation

Full fitted Q-iteration algorithm:

repeat until converge:

====1: collect dataset {(si,ai,si,ri}\{(s_i,a_i,s'_i,r_i\} using some policy; (dataset size N, collection policy)

====repeat K times; (iterations K)

======== 2: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

========3: set ϕargminϕ12iQϕ(si,ai)yi2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2 ; (gradient steps S)

Why off-policy

Given ss and aa, transition is independent of π\pi

Q-Iteration,off-policy

Optimize what

In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as β\beta, we have error:

E=12E(s,a)β[Qϕ(s,a)[r(s,a)+γmaxaQϕ(s,a)]]\mathcal{E}=\frac{1}{2}\mathbb{E}_{(s,a)\sim\beta}\left[Q_\phi(s,a)-\left[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')\right] \right]

If E=0\mathcal{E}=0, then Qϕ(s,a)=r(s,a)+γmaxaQϕ(s,a)Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a'), which is the optimal Q-function, corresponding to optimal policy π\pi'. But when we leave the tabular case and use neural network function approximation, most guarantee are lost.

Online Q-iteration

Set N=1,K=1N=1,K=1

Online Q-iteration algorithm:

repeat until converge:

====1: collect dataset {(si,ai,si,ri}\{(s_i,a_i,s'_i,r_i\} using some policy; (collection policy)

====2: set yir(si,ai)+γmaxaQϕ(si,ai)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)

====3: set ϕϕαdQϕd(si,ai)(Qϕ(si,ai)yi)\phi\leftarrow \phi-\alpha\frac{dQ_\phi}{d}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i) ; (gradient steps S)

Exploration

Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:

epsilon-greedy:

π(atst)={1ϵif at=argmaxatQϕ(st,at)ϵ/(A1)otherwise\pi(a_t|s_t)= \begin{cases} 1-\epsilon &\text{if }a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)\\ \epsilon/(|\mathcal{A}|-1)&\text{otherwise} \end{cases}

Boltzmann exploration:

π(atst)exp(Qϕ(st,at))\pi(a_t|s_t)\propto \exp(Q_\phi(s_t,a_t))

Last updated

Was this helpful?