Advantages

From last chapter, policy gradient is θJ(θ)1Ni(t=1Tθlogπθ(ai,tsi,t)(t=tTr(si,t,ai,t)))\nabla_\theta J(\theta)\approx \frac{1}{N}\sum_i \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})) \right), and then update the policy's parameter θθ+αθJ(θ)\theta \leftarrow \theta+\alpha \nabla_{\theta}J(\theta). Here Q^i,tπ=t=tTr(si,t,ai,t)\hat{Q}_{i,t}^\pi=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}) is Q function or "reward to go". We knew that Q^i,tπ\hat{Q}_{i,t}^\pi is the estimate of expected reward if we take action ai,ta_{i,t} in state si,ts_{i,t}, but it just used one trajectory to estimate. Can we do better? In theory, the true expected reward-to-go is:

Q(st,at)=t=tTE[r(st,at)st,at]Q(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}\left[r(s_{t'},a_{t'})|s_t,a_t\right]

Now, how about the baseline? In policy gradient, we use average Q, i.e. bt=1NiQ(si,t,ai,t)b_t=\frac{1}{N}\sum_i Q(s_{i,t},a_{i,t}). We can also use value function V(st)=Eatπθ(atst)[Q(st,at)]V(s_t)=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q(s_t,a_t)]. So we have:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Q(si,t,ai,t)V(si,t))1Ni=1Nt=1Tθlogπθ(ai,tsi,t)A(si,t,ai,t)\begin{aligned} \nabla_{\theta}J(\theta)&\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t},a_{i,t})-V(s_{i,t}))\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})A(s_{i,t},a_{i,t}) \end{aligned}

The better the estimate of advantage A(st,at)A(s_t,a_t), the lower the variance.

Last updated