Policy evaluation

Fit what ?

So now we need to fit one of $Q^\pi,V^\pi,A^\pi$ . The question is fit what to what? We knew:

\begin{aligned} Q^\pi(s_t,a_t)&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ V^\pi(s_t)&=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q^\pi(s_t,a_t)] \\ A^\pi(s_t,a_t)&=Q^\pi(s_t,a_t)-V^\pi(s_t) \end{aligned}

Firstly, both $Q^\pi$ and $A^\pi$ can be calculated or approximated by $V^\pi$ :

\begin{aligned} Q^\pi(s_t,a_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+V^\pi(s_{t+1})\\ A^\pi(s_t,a_t)&\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t) \end{aligned}

Apart from this, both $Q^\pi$ and $A^\pi$ need 2 inputs $s_t, a_t$ , but $V^\pi$ only need $s_t$ , which may be easier.

So let's just fit $V^\pi$ .

Fitted to what?

V^\pi(s_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]

Fit $V^\pi(s_t)$ , evaluate how good is the policy, which is called policy evaluation. As what policy gradient does, we can use Monte Carlo policy evaluation.

V^\pi(s_t)\approx\sum_{t'=t}^T r(s_{t'},a_{t'})

if we are able to reset the simulator, we can do more than one samples:

V^\pi(s_t)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})

But the former is still pretty good, so the supervised learning condition is:

training data: $\left\{\left (s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right) \right\}$ . Define target $y_{i,t}=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$ .

supervised regression: $\mathcal{L}(\phi)=\frac{1}{2}\sum_i ||\hat{V}_{\phi}^\pi(s_i)-y_i||^2$

Bootstrapped estimate

The Monte Carlo target $y_i$ is not perfect. Can we do better? The ideal target is:

\begin{aligned} y_{i,t}&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+V^\pi(s_{t+1})\\ &\approx r(s_{i,t},a_{i,t})+\hat{V}^\pi_{\phi}(s_{t+1})\\ \end{aligned}

Sample one step and the directly use previous fitted value function.

This estimate will cause more bias but lower variance.

Algorithm

Batch actor-critic algorithm:
repeat until converge:
====1: sample $\{s_i,a_i \}$ from $\pi_\theta(a|s)$ (run it on the robot)
====2: fit $\hat{V}^\pi_\phi(s)$ to sampled reward sums (use bootstrapped estimate target)
====3: evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)$
====4: $\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)$
====5: $\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)$

PreviousAdvantages NextDiscount factors

Last updated 6 years ago

Was this helpful?