Policy iteration

Last chapter we discuss actor-critic algorithms, but what if we just use critic (value), without an actor(policy)? Surely we can do this, by extracting a policy from a value function.

Firstly, $A^\pi(s_t,a_t)$ measures how much is $a_t$ than the average action according to $\pi$ , and $\arg \max_{a_t}A^\pi(s_t,a_t)$ means the best action from $s_t$ if we then follow $\pi$ , which can be viewed as a substitute for policy $\pi(s_t|a_t)$ . Notice that the "policy" obtained by max trick is at least as good as any normal policy, regardless of what $\pi(a_t|s_a)$ is.

So forget policies, let's just use max trick.

\pi'(a_t|s_t)= \begin{cases} 1&\text{if }a_t=\arg \max_{a_t}A^\pi(s_t,a_t)\\ 0 &\text{otherwise} \end{cases}

At a high level, we got the policy iteration algorithms:

Policy Iteration:
repeat until converge
====1: evaluate $A^\pi(s_t,a_t)$
====2: set $\pi\leftarrow \pi'$

As before $A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^\pi(s')]-V^\pi(s)$ . So now the key problem is to evaluate $V^\pi(s)$ .

Dynamic programming

First of all, some basic hypothesis: assume we know the dynamic $p(s'|s,a)$ , and $s$ and $a$ are both discrete (and small). For example:

16 states, 4 actions per state. Can store full $V^\pi(s)$ in a table. $\mathcal{T}$ is a $16\times16\times4$ tensor.

We can use bootstrapped update:

V^\pi(s)\leftarrow \mathbb{E}_{a\sim\pi(a|s)}\left[r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s's)]\right]

Since we use deterministic policy $\pi(s)=a$ , the update equation can be even simplified:

V^\pi(s)\leftarrow r(s,\pi(s))+\gamma\mathbb{E}_{s'\sim p(s'|s,\pi(s))}[V^\pi(s's)]

Simplified Policy Iteration:
repeat until converge:
==== 1: evaluate $V^\pi(s)$
==== 2: set $\pi\leftarrow \pi'$

PreviousValue Function Methods NextValue iteration

Last updated 5 years ago

Was this helpful?