Value iteration

Tabular value iteration

Considering that $\arg \max_{a_t}A^\pi(s_t,a_t)=\arg \max_{a_t}V^\pi(s_t,a_t)$ , we can just use $Q^\pi$ instead of $A^\pi$ .

Q^\pi(s,a)=r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s')]

Apart from that, the policy can be calculated by $\arg\max_{a}Q(s,a)$ . So skip the policy and compute values directly:

value iteration algorithm:
repeat until converge:
====1: set $Q(s,a)\leftarrow r(s,a)+\gamma \mathbb{E}_{s'\sim p(s'|s,a)}[V(s')]$
====2: set $V(s)\leftarrow \max_a Q(s,a)$

Fitted value iteration

The question is how do we represent $V(s)$ . For small cases, we can use a big table, one entry for each discrete $s$ , but it is not appropriate for real world, especially for image inputs, due to the curse of dimensionality. In this case, neural net function can be used: $V: \mathcal{S}\to \mathbb{R}$

\mathcal{L}(\phi)=\frac{1}{2}\left\|V_\phi(s)-\max_{a}Q^\pi(s,a) \right\|^2

Fitted value iteration
repeat until converge
====1: set $y_i\leftarrow \max_{a_i}(r(s_i,a_i)+\gamma \mathbb{E}[V_\phi(s'_i)])$
====2: set $\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|V_\phi(s_i)-y_i\|^2$

PreviousPolicy iteration NextQ iteration

Last updated 5 years ago

Was this helpful?