RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Critic as state-dependent baselines
  • Control variates: action-dependent baselines

Was this helpful?

  1. Deep RL Course
  2. Actor-Critic Algorithms

Baselines

Critic as state-dependent baselines

Actor-critic

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})-\hat{V}^\pi_\phi(s_{i,t}) \right)∇θ​J(θ)≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(r(si,t​,ai,t​)+γV^ϕπ​(si,t+1​)−V^ϕπ​(si,t​))

+: lower variance (due to critic)

-: not unbiased (if the critic is not perfect)

Policy gradient

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′)−b)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)∇θ​J(θ)≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​γt′−tr(si,t′​,ai,t′​)−b)

+: no bias

-: higher variance (because single-sample estimate)

can we use V^ϕπ\hat{V}^\pi_\phiV^ϕπ​ and still keep the estimate unbiased ?

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′)−V^ϕπ(si,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}^\pi_\phi(s_{i,t})\right)∇θ​J(θ)≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​γt′−tr(si,t′​,ai,t′​)−V^ϕπ​(si,t​))

+: no bias

+: lower variance (baseline is closer to rewards)

Control variates: action-dependent baselines

In theory

Aπ(st,at)=Qπ(st,at)−Vπ(st)A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)Aπ(st​,at​)=Qπ(st​,at​)−Vπ(st​)

Option 1:

A^π(st,at)=∑t′=t∞γt′−tr(st′,at′)−Vϕπ(st)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infty \gamma^{t'-t}r(s_{t'},a_{t'})-V^\pi_\phi(s_t)A^π(st​,at​)=t′=t∑∞​γt′−tr(st′​,at′​)−Vϕπ​(st​)

+: no bias

-: higher variance (because single-sample estimate)

Option 2:

A^π(st,at)=∑t′=t∞γt′−tr(st′,at′)−Qϕπ(st,at)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infty \gamma^{t'-t}r(s_{t'},a_{t'})-Q^\pi_\phi(s_t,a_t)A^π(st​,at​)=t′=t∑∞​γt′−tr(st′​,at′​)−Qϕπ​(st​,at​)

+: goes to zero in expectation is critic is correct

-: not correct

Combination:

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+1N∑i=1N∑t=1T∇θEa∼πθ(at∣st)[Qϕπ(si,t,ai,t)]\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\hat{Q}_{i,t}- Q^\pi_\phi(s_{i,t},a_{i,t})\right) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \mathbb{E}_{a\sim\pi_\theta(a_t|s_t)}\left[Q^\pi_\phi(s_{i,t},a_{i,t}) \right]∇θ​J(θ)≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(Q^​i,t​−Qϕπ​(si,t​,ai,t​))+N1​i=1∑N​t=1∑T​∇θ​Ea∼πθ​(at​∣st​)​[Qϕπ​(si,t​,ai,t​)]
PreviousActor-Critic in practiceNextOther advantages

Last updated 5 years ago

Was this helpful?