Learning theory
Tabular value function
Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is
And a few notations:
ra: stacked vector of rewards at all states for action a
Ta: matrix of transitions for action a such that Ta,i,j=p(s′=i∣s=j,a)
Then define an operator B:BV=maxa(ra+γTaV).
If V⋆ is a fixed point of B, i.e. BV⋆=V⋆, then V⋆(s)=r(s,a)+γE[V⋆(s′)]
The good thing is that in tabular case, V⋆ always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches V⋆ because B is a contraction: for any V and Vˉ, we have ∥BV−BVˉ∥∞≤γ∥V−Vˉ∥∞, and gap between V and Vˉ always gets smaller by γ with respect to ∞-norm.
Simply choosing V⋆ as Vˉ, we got the proof.
Non-tabular value function
For tabular case, V←BV. For non-tabular case, we need to define new operator Π:ΠV=argminV′∈Ω21∑∥V′(s)−V(s)∥2, where Π is a projection onto Ω, in terms of l2.
So the fitted value iteration algorithm is V←ΠBV
B is a contraction w.r.t. ∞-norm ("max" norm)
Π is a contraction w.r.t. l2-norm (Euclidean distance)
but, ΠB is not a contraction to any kind.
So, value iteration converges in tabular cases, but not in fitted one.
Fitted Q-iteration
Same as above.
Fitted bootstrapped policy evaluation doesn't converge.
Last updated