# Learning theory

## Tabular value function

Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is

$$
V(s)\leftarrow \max\_a\[ r(s,a)+\gamma \mathbb{E}\[V(s')] ]
$$

And a few notations:

$$r\_a$$: stacked vector of rewards at all states for action $$a$$

$$\mathcal{T\_a}$$: matrix of transitions for action $$a$$ such that $$\mathcal{T}\_{a,i,j}=p(s'=i|s=j,a)$$

Then define an operator $$\mathcal{B}: \mathcal{B}V=\max\_a(r\_a+\gamma\mathcal{T}\_aV)$$.

If $$V^\star$$ is a fixed point of $$\mathcal{B}$$, i.e. $$\mathcal{B}V^\star=V^\star$$, then $$V^\star(s)=r(s,a)+\gamma\mathbb{E}\[V^\star(s')]$$

The good thing is that in tabular case, $$V^\star$$ always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches $$V^\star$$ because $$\mathcal{B}$$ is a contraction: for any $$V$$ and $$\bar{V}$$, we have $$|\mathcal{B}V-\mathcal{B}\bar{V}|*{\infty}\le \gamma|V-\bar{V}|*\infty$$, and gap between $$V$$ and $$\bar{V}$$ always gets smaller by $$\gamma$$ with respect to $$\infty$$-norm.

Simply choosing $$V^\star$$ as $$\bar{V}$$, we got the proof.

## Non-tabular value function

For tabular case, $$V\leftarrow \mathcal{B}V$$. For non-tabular case, we need to define new operator $$\Pi:\Pi V=\arg\min\_{V'\in \Omega}\frac{1}{2}\sum|V'(s)-V(s)|^2$$, where $$\Pi$$ is a projection onto $$\Omega$$, in terms of $$\mathcal{l}\_2$$.

So the fitted value iteration algorithm is $$V\leftarrow \Pi\mathcal{B}V$$

$$\mathcal{B}$$ is a contraction w\.r.t. $$\infty$$-norm ("max" norm)

$$\Pi$$ is a contraction w\.r.t. $$\mathcal{l}\_2$$-norm (Euclidean distance)

but, $$\Pi\mathcal{B}$$ is not a contraction to any kind.

So, value iteration converges in tabular cases, but not in fitted one.

## Fitted Q-iteration

Same as above.

Fitted bootstrapped policy evaluation doesn't converge.
