Value iteration

Tabular value iteration

Considering that argmaxatAπ(st,at)=argmaxatVπ(st,at)\arg \max_{a_t}A^\pi(s_t,a_t)=\arg \max_{a_t}V^\pi(s_t,a_t), we can just use QπQ^\pi instead of AπA^\pi.

Qπ(s,a)=r(s,a)+γEsp(ss,a)[Vπ(s)]Q^\pi(s,a)=r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s')]

Apart from that, the policy can be calculated by argmaxaQ(s,a)\arg\max_{a}Q(s,a). So skip the policy and compute values directly:

value iteration algorithm:

repeat until converge:

====1: set Q(s,a)r(s,a)+γEsp(ss,a)[V(s)]Q(s,a)\leftarrow r(s,a)+\gamma \mathbb{E}_{s'\sim p(s'|s,a)}[V(s')]

====2: set V(s)maxaQ(s,a)V(s)\leftarrow \max_a Q(s,a)

Fitted value iteration

The question is how do we represent V(s)V(s). For small cases, we can use a big table, one entry for each discrete ss, but it is not appropriate for real world, especially for image inputs, due to the curse of dimensionality. In this case, neural net function can be used: V:SRV: \mathcal{S}\to \mathbb{R}

L(ϕ)=12Vϕ(s)maxaQπ(s,a)2\mathcal{L}(\phi)=\frac{1}{2}\left\|V_\phi(s)-\max_{a}Q^\pi(s,a) \right\|^2

Fitted value iteration

repeat until converge

====1: set yimaxai(r(si,ai)+γE[Vϕ(si)])y_i\leftarrow \max_{a_i}(r(s_i,a_i)+\gamma \mathbb{E}[V_\phi(s'_i)])

====2: set ϕargminϕ12iVϕ(si)yi2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|V_\phi(s_i)-y_i\|^2

Last updated