Policy iteration
Policy iteration
Last chapter we discuss actor-critic algorithms, but what if we just use critic (value), without an actor(policy)? Surely we can do this, by extracting a policy from a value function.
Firstly, Aπ(st,at) measures how much is at than the average action according to π, and argmaxatAπ(st,at) means the best action from st if we then follow π, which can be viewed as a substitute for policy π(st∣at). Notice that the "policy" obtained by max trick is at least as good as any normal policy, regardless of what π(at∣sa) is.
So forget policies, let's just use max trick.

At a high level, we got the policy iteration algorithms:
Policy Iteration:
repeat until converge
====1: evaluate Aπ(st,at)
====2: set π←π′
As before Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s). So now the key problem is to evaluate Vπ(s).
Dynamic programming
First of all, some basic hypothesis: assume we know the dynamic p(s′∣s,a), and s and a are both discrete (and small). For example:

16 states, 4 actions per state. Can store full Vπ(s) in a table. T is a 16×16×4 tensor.
We can use bootstrapped update:
Since we use deterministic policy π(s)=a, the update equation can be even simplified:

Simplified Policy Iteration:
repeat until converge:
==== 1: evaluate Vπ(s)
==== 2: set π←π′
Last updated