From last chapter, policy gradient is ∇θJ(θ)≈N1∑i(∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTr(si,t′,ai,t′))), and then update the policy's parameter θ←θ+α∇θJ(θ). Here Q^i,tπ=∑t′=tTr(si,t′,ai,t′) is Q function or "reward to go". We knew that Q^i,tπ is the estimate of expected reward if we take action ai,t in state si,t, but it just used one trajectory to estimate. Can we do better? In theory, the true expected reward-to-go is:
Q(st,at)=t′=t∑TE[r(st′,at′)∣st,at] Now, how about the baseline? In policy gradient, we use average Q, i.e. bt=N1∑iQ(si,t,ai,t). We can also use value function V(st)=Eat∼πθ(at∣st)[Q(st,at)]. So we have:
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(Q(si,t,ai,t)−V(si,t))≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)A(si,t,ai,t) The better the estimate of advantage A(st,at), the lower the variance.