Comparison to maximum likelihood
Policy gradient:
∇θJ(θ)=N1i=1∑N[(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))]=N1i∑N∇θlogπθ(τi)r(τi) Maximum likelihood:
∇θJML(θ)=N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))=N1i=1∑N∇θlogπθ(τi) The only difference is that policy gradient update formula has a weight of r(τ), which means good stuff(with high reward sum) is made more likely but bad stuff(with low reward sum) is made less likely. To conclude, this algorithm simply formalizes the notion of "trial and error".