Learning theory

Tabular value function

Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is

V(s)maxa[r(s,a)+γE[V(s)]]V(s)\leftarrow \max_a[ r(s,a)+\gamma \mathbb{E}[V(s')] ]

And a few notations:

rar_a: stacked vector of rewards at all states for action aa

Ta\mathcal{T_a}: matrix of transitions for action aa such that Ta,i,j=p(s=is=j,a)\mathcal{T}_{a,i,j}=p(s'=i|s=j,a)

Then define an operator B:BV=maxa(ra+γTaV)\mathcal{B}: \mathcal{B}V=\max_a(r_a+\gamma\mathcal{T}_aV).

If VV^\star is a fixed point of B\mathcal{B}, i.e. BV=V\mathcal{B}V^\star=V^\star, then V(s)=r(s,a)+γE[V(s)]V^\star(s)=r(s,a)+\gamma\mathbb{E}[V^\star(s')]

The good thing is that in tabular case, VV^\star always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches VV^\star because B\mathcal{B} is a contraction: for any VV and Vˉ\bar{V}, we have BVBVˉγVVˉ\|\mathcal{B}V-\mathcal{B}\bar{V}\|_{\infty}\le \gamma\|V-\bar{V}\|_\infty, and gap between VV and Vˉ\bar{V} always gets smaller by γ\gamma with respect to \infty-norm.

Simply choosing VV^\star as Vˉ\bar{V}, we got the proof.

Non-tabular value function

For tabular case, VBVV\leftarrow \mathcal{B}V. For non-tabular case, we need to define new operator Π:ΠV=argminVΩ12V(s)V(s)2\Pi:\Pi V=\arg\min_{V'\in \Omega}\frac{1}{2}\sum\|V'(s)-V(s)\|^2, where Π\Pi is a projection onto Ω\Omega, in terms of l2\mathcal{l}_2.

So the fitted value iteration algorithm is VΠBVV\leftarrow \Pi\mathcal{B}V

B\mathcal{B} is a contraction w.r.t. \infty-norm ("max" norm)

Π\Pi is a contraction w.r.t. l2\mathcal{l}_2-norm (Euclidean distance)

but, ΠB\Pi\mathcal{B} is not a contraction to any kind.

So, value iteration converges in tabular cases, but not in fitted one.

Fitted Q-iteration

Same as above.

Fitted bootstrapped policy evaluation doesn't converge.

Last updated