Policy iteration

Policy iteration

Last chapter we discuss actor-critic algorithms, but what if we just use critic (value), without an actor(policy)? Surely we can do this, by extracting a policy from a value function.

Firstly, Aπ(st,at)A^\pi(s_t,a_t) measures how much is ata_t than the average action according to π\pi, and argmaxatAπ(st,at)\arg \max_{a_t}A^\pi(s_t,a_t) means the best action from sts_t if we then follow π\pi, which can be viewed as a substitute for policy π(stat)\pi(s_t|a_t). Notice that the "policy" obtained by max trick is at least as good as any normal policy, regardless of what π(atsa)\pi(a_t|s_a) is.

So forget policies, let's just use max trick.

π(atst)={1if at=argmaxatAπ(st,at)0otherwise\pi'(a_t|s_t)= \begin{cases} 1&\text{if }a_t=\arg \max_{a_t}A^\pi(s_t,a_t)\\ 0 &\text{otherwise} \end{cases}

At a high level, we got the policy iteration algorithms:

Policy Iteration:

repeat until converge

====1: evaluate Aπ(st,at)A^\pi(s_t,a_t)

====2: set ππ\pi\leftarrow \pi'

As before Aπ(s,a)=r(s,a)+γE[Vπ(s)]Vπ(s)A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^\pi(s')]-V^\pi(s). So now the key problem is to evaluate Vπ(s)V^\pi(s).

Dynamic programming

First of all, some basic hypothesis: assume we know the dynamic p(ss,a)p(s'|s,a), and ss and aa are both discrete (and small). For example:

16 states, 4 actions per state. Can store full Vπ(s)V^\pi(s) in a table. T\mathcal{T} is a 16×16×416\times16\times4 tensor.

We can use bootstrapped update:

Vπ(s)Eaπ(as)[r(s,a)+γEsp(ss,a)[Vπ(ss)]]V^\pi(s)\leftarrow \mathbb{E}_{a\sim\pi(a|s)}\left[r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s's)]\right]

Since we use deterministic policy π(s)=a\pi(s)=a, the update equation can be even simplified:

Vπ(s)r(s,π(s))+γEsp(ss,π(s))[Vπ(ss)]V^\pi(s)\leftarrow r(s,\pi(s))+\gamma\mathbb{E}_{s'\sim p(s'|s,\pi(s))}[V^\pi(s's)]

Simplified Policy Iteration:

repeat until converge:

==== 1: evaluate Vπ(s)V^\pi(s)

==== 2: set ππ\pi\leftarrow \pi'

Last updated