RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Eligibility traces & n-step returns
  • Generalized advantage estimate

Was this helpful?

  1. Deep RL Course
  2. Actor-Critic Algorithms

Other advantages

PreviousBaselinesNextValue Function Methods

Last updated 5 years ago

Was this helpful?

Eligibility traces & n-step returns

V^ϕπ\hat{V}^\pi_\phiV^ϕπ​ will bring bias and reward-sum will bring variance.

Critic

A^Cπ=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st)\hat{A}^\pi_{C}=r(s_{t},a_{t})+\gamma \hat{V}^\pi_\phi(s_{t+1})-\hat{V}^\pi_\phi(s_{t})A^Cπ​=r(st​,at​)+γV^ϕπ​(st+1​)−V^ϕπ​(st​)

+: lower variance

-: higher bias if value is wrong (it always is)

Monte Carlo

A^MCπ=∑t′=t∞γt′−tr(st′,at′)−V^ϕπ(st)\hat{A}^\pi_{MC}=\sum_{t'=t}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}^\pi_\phi(s_{t})A^MCπ​=t′=t∑∞​γt′−tr(st′​,at′​)−V^ϕπ​(st​)

+: no bias

-: higher variance (because single-sample estimate)

Can we combine these two, to control bias/variance tradeoff?

Reward declines due to discount factor γ\gammaγ . We can early cut.

A^nπ(st,at)=∑t′=tt+nγt′−tr(st′,at′)+γnV^ϕπ(st+n)−V^ϕπ(st)\hat{A}^\pi_n(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n \hat{V}^\pi_\phi(s_{t+n})-\hat{V}^\pi_\phi(s_{t})A^nπ​(st​,at​)=t′=t∑t+n​γt′−tr(st′​,at′​)+γnV^ϕπ​(st+n​)−V^ϕπ​(st​)

Choosing n>1n>1n>1 often works better.

Generalized advantage estimate

Do we have to choose just one nnn ? We can cut everywhere all at once.

Use weighted combination of n-step returns:

wn∝λn−1w_n\propto \lambda^{n-1}wn​∝λn−1

A^GAEπ(st,at)=∑n=1∞wnA^nπ(st,at)=r(st,at)+γ((1−λ))V^ϕπ(st+1)+λ(r(st+1,at+1)+γ((1−λ)V^ϕπ(st+2)+λr(st+2,at+2)+⋯ ))=∑t′=t∞(γλ)t′−t(r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′))=∑t′=t∞(γλ)t′−tδt′\begin{aligned} \hat{A}^\pi_{GAE}(s_t,a_t)&=\sum_{n=1}^\infty w_n \hat{A}^\pi_n(s_t,a_t) \\ &=r(s_t,a_t)+\gamma((1-\lambda))\hat{V}^\pi_\phi(s_{t+1})+\lambda(r(s_{t+1},a_{t+1})+\gamma((1-\lambda)\hat{V}^\pi_\phi(s_{t+2})+\lambda r(s_{t+2},a_{t+2})+\cdots))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}(r(s_{t'},a_{t'})+\gamma\hat{V}^\pi_\phi(s_{t'+1})-\hat{V}^\pi_\phi(s_{t'}))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}\delta_{t'} \end{aligned}A^GAEπ​(st​,at​)​=n=1∑∞​wn​A^nπ​(st​,at​)=r(st​,at​)+γ((1−λ))V^ϕπ​(st+1​)+λ(r(st+1​,at+1​)+γ((1−λ)V^ϕπ​(st+2​)+λr(st+2​,at+2​)+⋯))=t′=t∑∞​(γλ)t′−t(r(st′​,at′​)+γV^ϕπ​(st′+1​)−V^ϕπ​(st′​))=t′=t∑∞​(γλ)t′−tδt′​​
discount reward
n-step returns