Intuition of PG

Comparison to maximum likelihood

Policy gradient:

θJ(θ)=1Ni=1N[(t=1Tθlogπθ(ai,tsi,t))(t=1Tr(si,t,ai,t))]=1NiNθlogπθ(τi)r(τi)\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right] = \frac{1}{N}\sum_i^N \nabla_\theta \log \pi_\theta (\tau_i)r(\tau_i)

Maximum likelihood:

θJML(θ)=1Ni=1N(t=1Tθlogπθ(ai,tsi,t))=1Ni=1Nθlogπθ(τi)\nabla_\theta J_{ML}(\theta) =\frac{1}{N}\sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) =\frac{1}{N}\sum_{i=1}^N \nabla_\theta \log \pi_\theta (\tau_i)

The only difference is that policy gradient update formula has a weight of r(τ)r(\tau), which means good stuff(with high reward sum) is made more likely but bad stuff(with low reward sum) is made less likely. To conclude, this algorithm simply formalizes the notion of "trial and error".

Last updated

Was this helpful?