Learning theory

Tabular value function

Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is

V(s)\leftarrow \max_a[ r(s,a)+\gamma \mathbb{E}[V(s')] ]

And a few notations:

$r_a$ : stacked vector of rewards at all states for action $a$

$\mathcal{T_a}$ : matrix of transitions for action $a$ such that $\mathcal{T}_{a,i,j}=p(s'=i|s=j,a)$

Then define an operator $\mathcal{B}: \mathcal{B}V=\max_a(r_a+\gamma\mathcal{T}_aV)$ .

If $V^\star$ is a fixed point of $\mathcal{B}$ , i.e. $\mathcal{B}V^\star=V^\star$ , then $V^\star(s)=r(s,a)+\gamma\mathbb{E}[V^\star(s')]$

The good thing is that in tabular case, $V^\star$ always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches $V^\star$ because $\mathcal{B}$ is a contraction: for any $V$ and $\bar{V}$ , we have $\|\mathcal{B}V-\mathcal{B}\bar{V}\|_{\infty}\le \gamma\|V-\bar{V}\|_\infty$ , and gap between $V$ and $\bar{V}$ always gets smaller by $\gamma$ with respect to $\infty$ -norm.

Simply choosing $V^\star$ as $\bar{V}$ , we got the proof.

Non-tabular value function

For tabular case, $V\leftarrow \mathcal{B}V$ . For non-tabular case, we need to define new operator $\Pi:\Pi V=\arg\min_{V'\in \Omega}\frac{1}{2}\sum\|V'(s)-V(s)\|^2$ , where $\Pi$ is a projection onto $\Omega$ , in terms of $\mathcal{l}_2$ .

So the fitted value iteration algorithm is $V\leftarrow \Pi\mathcal{B}V$

$\mathcal{B}$ is a contraction w.r.t. $\infty$ -norm ("max" norm)

$\Pi$ is a contraction w.r.t. $\mathcal{l}_2$ -norm (Euclidean distance)

but, $\Pi\mathcal{B}$ is not a contraction to any kind.

So, value iteration converges in tabular cases, but not in fitted one.

Fitted Q-iteration

Same as above.

Fitted bootstrapped policy evaluation doesn't converge.

PreviousQ iteration NextDeep RL with Q-Function

Last updated 5 years ago

Was this helpful?