Policy iteration
Last updated
Last updated
Last chapter we discuss actor-critic algorithms, but what if we just use critic (value), without an actor(policy)? Surely we can do this, by extracting a policy from a value function.
Firstly, measures how much is than the average action according to , and means the best action from if we then follow , which can be viewed as a substitute for policy . Notice that the "policy" obtained by max trick is at least as good as any normal policy, regardless of what is.
So forget policies, let's just use max trick.
At a high level, we got the policy iteration algorithms:
Policy Iteration:
repeat until converge
We can use bootstrapped update:
Simplified Policy Iteration:
repeat until converge:
====1: evaluate
====2: set
As before . So now the key problem is to evaluate .
First of all, some basic hypothesis: assume we know the dynamic , and and are both discrete (and small). For example:
16 states, 4 actions per state. Can store full in a table. is a tensor.
Since we use deterministic policy , the update equation can be even simplified:
==== 1: evaluate
==== 2: set