Learning theory
Last updated
Last updated
Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is
And a few notations:
: stacked vector of rewards at all states for action
: matrix of transitions for action such that
Then define an operator .
If is a fixed point of , i.e. , then
The good thing is that in tabular case, always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches because is a contraction: for any and , we have , and gap between and always gets smaller by with respect to -norm.
Simply choosing as , we got the proof.
For tabular case, . For non-tabular case, we need to define new operator , where is a projection onto , in terms of .
So the fitted value iteration algorithm is
So, value iteration converges in tabular cases, but not in fitted one.
Same as above.
Fitted bootstrapped policy evaluation doesn't converge.
is a contraction w.r.t. -norm ("max" norm)
is a contraction w.r.t. -norm (Euclidean distance)
but, is not a contraction to any kind.