RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page

Was this helpful?

  1. Deep RL Course
  2. Actor-Critic Algorithms

Advantages

PreviousActor-Critic AlgorithmsNextPolicy evaluation

Last updated 5 years ago

Was this helpful?

From last chapter, policy gradient is ∇θJ(θ)≈1N∑i(∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTr(si,t′,ai,t′)))\nabla_\theta J(\theta)\approx \frac{1}{N}\sum_i \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})) \right)∇θ​J(θ)≈N1​∑i​(∑t=1T​∇θ​logπθ​(ai,t​∣si,t​)(∑t′=tT​r(si,t′​,ai,t′​))), and then update the policy's parameter θ←θ+α∇θJ(θ)\theta \leftarrow \theta+\alpha \nabla_{\theta}J(\theta)θ←θ+α∇θ​J(θ). Here Q^i,tπ=∑t′=tTr(si,t′,ai,t′)\hat{Q}_{i,t}^\pi=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})Q^​i,tπ​=∑t′=tT​r(si,t′​,ai,t′​) is Q function or "reward to go". We knew that Q^i,tπ\hat{Q}_{i,t}^\piQ^​i,tπ​ is the estimate of expected reward if we take action ai,ta_{i,t}ai,t​ in state si,ts_{i,t}si,t​, but it just used one trajectory to estimate. Can we do better? In theory, the true expected reward-to-go is:

Q(st,at)=∑t′=tTE[r(st′,at′)∣st,at]Q(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}\left[r(s_{t'},a_{t'})|s_t,a_t\right]Q(st​,at​)=t′=t∑T​E[r(st′​,at′​)∣st​,at​]

Now, how about the baseline? In policy gradient, we use average Q, i.e. bt=1N∑iQ(si,t,ai,t)b_t=\frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})bt​=N1​∑i​Q(si,t​,ai,t​). We can also use value function V(st)=Eat∼πθ(at∣st)[Q(st,at)]V(s_t)=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q(s_t,a_t)]V(st​)=Eat​∼πθ​(at​∣st​)​[Q(st​,at​)]. So we have:

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(Q(si,t,ai,t)−V(si,t))≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)A(si,t,ai,t)\begin{aligned} \nabla_{\theta}J(\theta)&\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t},a_{i,t})-V(s_{i,t}))\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta (a_{i,t}|s_{i,t})A(s_{i,t},a_{i,t}) \end{aligned}∇θ​J(θ)​≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(Q(si,t​,ai,t​)−V(si,t​))≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)A(si,t​,ai,t​)​

The better the estimate of advantage A(st,at)A(s_t,a_t)A(st​,at​), the lower the variance.