Advantages

From last chapter, policy gradient is $\nabla_\theta J(\theta)\approx \frac{1}{N}\sum_i \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})) \right)$ , and then update the policy's parameter $\theta \leftarrow \theta+\alpha \nabla_{\theta}J(\theta)$ . Here $\hat{Q}_{i,t}^\pi=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$ is Q function or "reward to go". We knew that $\hat{Q}_{i,t}^\pi$ is the estimate of expected reward if we take action $a_{i,t}$ in state $s_{i,t}$ , but it just used one trajectory to estimate. Can we do better? In theory, the true expected reward-to-go is:

Q(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}\left[r(s_{t'},a_{t'})|s_t,a_t\right]

Now, how about the baseline? In policy gradient, we use average Q, i.e. $b_t=\frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})$ . We can also use value function $V(s_t)=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q(s_t,a_t)]$ . So we have:

\begin{aligned} \nabla_{\theta}J(\theta)&\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t},a_{i,t})-V(s_{i,t}))\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})A(s_{i,t},a_{i,t}) \end{aligned}

The better the estimate of advantage $A(s_t,a_t)$ , the lower the variance.

PreviousActor-Critic Algorithms NextPolicy evaluation

Last updated 5 years ago

Was this helpful?