RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Infinite cases
  • Discount factors for policy gradient
  • Actor-critic algorithms (with discount)

Was this helpful?

  1. Deep RL Course
  2. Actor-Critic Algorithms

Discount factors

PreviousPolicy evaluationNextActor-Critic in practice

Last updated 5 years ago

Was this helpful?

Infinite cases

By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if TTT is ∞\infty∞? In many cases, V^ϕπ\hat{V}^\pi_\phiV^ϕπ​ can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.

We have γ\gammaγ chances to die every step:

So the new target is:

Discount factors for policy gradient

In Monte Carlo policy gradients, we have 2 options:

option 1:

option 2:

Consider causality:

Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.

Actor-critic algorithms (with discount)

batch version

batch actor-critic algorithm:

repeat until converge:

online version

online actor-critic algorithm:

repeat until converge:

yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t}\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})yi,t​≈r(si,t​,ai,t​)+γV^ϕπ​(si,t+1​)

And discount factor γ∈[0,1]\gamma \in [0,1]γ∈[0,1]. (0.99 works well)

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t))≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)A^ϕπ(si,t,ai,t)\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})-\hat{V}^\pi_\phi(s_{i,t}) \right) \\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{A}^\pi_\phi(s_{i,t},a_{i,t}) \end{aligned}∇θ​J(θ)​≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(r(si,t​,ai,t​)+γV^ϕπ​(si,t+1​)−V^ϕπ​(si,t​))≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)A^ϕπ​(si,t​,ai,t​)​
∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)∇θ​J(θ)≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​γt′−tr(si,t′​,ai,t′​))
∇θJ(θ)≈1N∑i=1N(∑t=1T∇θlog⁡πθ(ai,t∣si,t))(∑t=1Tγt−1r(si,t,ai,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T \gamma^{t-1} r(s_{i,t},a_{i,t})\right)∇θ​J(θ)≈N1​i=1∑N​(t=1∑T​∇θ​logπθ​(ai,t​∣si,t​))(t=1∑T​γt−1r(si,t​,ai,t​))
∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−1r(si,t′,ai,t′))≈1N∑i=1N∑t=1Tγt−1∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-1} r(s_{i,t'},a_{i,t'})\right)\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\gamma^{t-1}\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)\\ \end{aligned}∇θ​J(θ)​≈N1​i=1∑N​t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​γt′−1r(si,t′​,ai,t′​))≈N1​i=1∑N​t=1∑T​γt−1∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​γt′−tr(si,t′​,ai,t′​))​

option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances γ\gammaγ to die every step. But option 1 is what we choose.

====1: sample {si,ai}\{s_i,a_i \}{si​,ai​} from πθ(a∣s)\pi_\theta(a|s)πθ​(a∣s) (run it on the robot)

====2: fit V^ϕπ(s)\hat{V}^\pi_\phi(s)V^ϕπ​(s) to sampled reward sums (use bootstrapped estimate target)

====3: evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)A^π(si​,ai​)=r(si​,ai​)+γV^ϕπ​(si′​)−V^ϕπ​(si​)

====4: ∇θJ(θ)≈∑i∇θlog⁡πθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)∇θ​J(θ)≈∑i​∇θ​logπθ​(ai​∣si​)A^π(si​,ai​)

====5: θ←θ+α∇θJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)θ←θ+α∇θ​J(θ)

====1: take action a∼πθ(a∣s)a\sim \pi_\theta(a|s)a∼πθ​(a∣s), get (s,a,s′,r)(s,a,s',r)(s,a,s′,r)

====2: update V^ϕπ\hat{V}^\pi_\phiV^ϕπ​ using target r+γV^ϕπ(s′)r+\gamma\hat{V}^\pi_\phi(s')r+γV^ϕπ​(s′)

====3: evaluate A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)A^π(s,a)=r(s,a)+γV^ϕπ​(s′)−V^ϕπ​(s)

====4: ∇θJ(θ)≈∇θlog⁡πθ(a∣s)A^π(s,a)\nabla_\theta J(\theta)\approx \nabla_\theta \log \pi_\theta (a|s)\hat{A}^\pi(s,a)∇θ​J(θ)≈∇θ​logπθ​(a∣s)A^π(s,a)

====5: θ←θ+α∇θJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)θ←θ+α∇θ​J(θ)

new MDP