Considering that argmaxatAπ(st,at)=argmaxatVπ(st,at), we can just use Qπ instead of Aπ.
Qπ(s,a)=r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]
Apart from that, the policy can be calculated by argmaxaQ(s,a). So skip the policy and compute values directly:
value iteration algorithm:
repeat until converge:
====1: set Q(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[V(s′)]
====2: set V(s)←maxaQ(s,a)
Fitted value iteration
The question is how do we represent V(s). For small cases, we can use a big table, one entry for each discrete s, but it is not appropriate for real world, especially for image inputs, due to the curse of dimensionality. In this case, neural net function can be used: V:S→R