RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page

Was this helpful?

  1. Deep RL Course
  2. Policy Gradient

Intuition of PG

Comparison to maximum likelihood

Policy gradient:

∇θJ(θ)=1N∑i=1N[(∑t=1T∇θlog⁡πθ(ai,t∣si,t))(∑t=1Tr(si,t,ai,t))]=1N∑iN∇θlog⁡πθ(τi)r(τi)\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right] = \frac{1}{N}\sum_i^N \nabla_\theta \log \pi_\theta (\tau_i)r(\tau_i)∇θ​J(θ)=N1​i=1∑N​[(t=1∑T​∇θ​logπθ​(ai,t​∣si,t​))(t=1∑T​r(si,t​,ai,t​))]=N1​i∑N​∇θ​logπθ​(τi​)r(τi​)

Maximum likelihood:

∇θJML(θ)=1N∑i=1N(∑t=1T∇θlog⁡πθ(ai,t∣si,t))=1N∑i=1N∇θlog⁡πθ(τi)\nabla_\theta J_{ML}(\theta) =\frac{1}{N}\sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) =\frac{1}{N}\sum_{i=1}^N \nabla_\theta \log \pi_\theta (\tau_i)∇θ​JML​(θ)=N1​i=1∑N​(t=1∑T​∇θ​logπθ​(ai,t​∣si,t​))=N1​i=1∑N​∇θ​logπθ​(τi​)

The only difference is that policy gradient update formula has a weight of r(τ)r(\tau)r(τ), which means good stuff(with high reward sum) is made more likely but bad stuff(with low reward sum) is made less likely. To conclude, this algorithm simply formalizes the notion of "trial and error".

PreviousEvaluate the PGNextReduce variance

Last updated 5 years ago

Was this helpful?