RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Fit what ?
  • Fitted to what?
  • Bootstrapped estimate
  • Algorithm

Was this helpful?

  1. Deep RL Course
  2. Actor-Critic Algorithms

Policy evaluation

PreviousAdvantagesNextDiscount factors

Last updated 5 years ago

Was this helpful?

Fit what ?

So now we need to fit one of Qπ,Vπ,AπQ^\pi,V^\pi,A^\piQπ,Vπ,Aπ. The question is fit what to what? We knew:

Fitted to what?

if we are able to reset the simulator, we can do more than one samples:

But the former is still pretty good, so the supervised learning condition is:

Bootstrapped estimate

Sample one step and the directly use previous fitted value function.

This estimate will cause more bias but lower variance.

Algorithm

Batch actor-critic algorithm:

repeat until converge:

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]Aπ(st,at)=Qπ(st,at)−Vπ(st)\begin{aligned} Q^\pi(s_t,a_t)&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ V^\pi(s_t)&=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q^\pi(s_t,a_t)] \\ A^\pi(s_t,a_t)&=Q^\pi(s_t,a_t)-V^\pi(s_t) \end{aligned}Qπ(st​,at​)Vπ(st​)Aπ(st​,at​)​=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​,at​]=Eat​∼πθ​(at​∣st​)​[Qπ(st​,at​)]=Qπ(st​,at​)−Vπ(st​)​

Firstly, both QπQ^\piQπ and AπA^\piAπ can be calculated or approximated by VπV^\piVπ:

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]≈r(st,at)+∑t′=t+1TEπθ[r(st′,at′)∣st,at]≈r(st,at)+Vπ(st+1)Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)\begin{aligned} Q^\pi(s_t,a_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+V^\pi(s_{t+1})\\ A^\pi(s_t,a_t)&\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t) \end{aligned}Qπ(st​,at​)Aπ(st​,at​)​=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​,at​]≈r(st​,at​)+t′=t+1∑T​Eπθ​​[r(st′​,at′​)∣st​,at​]≈r(st​,at​)+Vπ(st+1​)≈r(st​,at​)+Vπ(st+1​)−Vπ(st​)​

Apart from this, both QπQ^\piQπ and AπA^\piAπ need 2 inputs st,ats_t, a_tst​,at​, but VπV^\piVπ only need sts_tst​, which may be easier.

So let's just fit VπV^\piVπ.

Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​]

Fit Vπ(st)V^\pi(s_t)Vπ(st​), evaluate how good is the policy, which is called policy evaluation. As what policy gradient does, we can use Monte Carlo policy evaluation.

Vπ(st)≈∑t′=tTr(st′,at′)V^\pi(s_t)\approx\sum_{t'=t}^T r(s_{t'},a_{t'})Vπ(st​)≈t′=t∑T​r(st′​,at′​)
Vπ(st)≈1N∑i=1N∑t′=tTr(st′,at′)V^\pi(s_t)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})Vπ(st​)≈N1​i=1∑N​t′=t∑T​r(st′​,at′​)

training data: {(si,t,∑t′=tTr(si,t′,ai,t′))}\left\{\left (s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right) \right\}{(si,t​,∑t′=tT​r(si,t′​,ai,t′​))}. Define target yi,t=∑t′=tTr(si,t′,ai,t′)y_{i,t}=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})yi,t​=∑t′=tT​r(si,t′​,ai,t′​).

supervised regression: L(ϕ)=12∑i∣∣V^ϕπ(si)−yi∣∣2\mathcal{L}(\phi)=\frac{1}{2}\sum_i ||\hat{V}_{\phi}^\pi(s_i)-y_i||^2L(ϕ)=21​∑i​∣∣V^ϕπ​(si​)−yi​∣∣2

The Monte Carlo target yiy_iyi​ is not perfect. Can we do better? The ideal target is:

yi,t=∑t′=tTEπθ[r(si,t′,ai,t′)∣si,t]≈r(si,t,ai,t)+∑t′=t+1TEπθ[r(si,t′,ai,t′)∣si,t]≈r(si,t,ai,t)+Vπ(st+1)≈r(si,t,ai,t)+V^ϕπ(st+1)\begin{aligned} y_{i,t}&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+V^\pi(s_{t+1})\\ &\approx r(s_{i,t},a_{i,t})+\hat{V}^\pi_{\phi}(s_{t+1})\\ \end{aligned}yi,t​​=t′=t∑T​Eπθ​​[r(si,t′​,ai,t′​)∣si,t​]≈r(si,t​,ai,t​)+t′=t+1∑T​Eπθ​​[r(si,t′​,ai,t′​)∣si,t​]≈r(si,t​,ai,t​)+Vπ(st+1​)≈r(si,t​,ai,t​)+V^ϕπ​(st+1​)​

====1: sample {si,ai}\{s_i,a_i \}{si​,ai​} from πθ(a∣s)\pi_\theta(a|s)πθ​(a∣s) (run it on the robot)

====2: fit V^ϕπ(s)\hat{V}^\pi_\phi(s)V^ϕπ​(s) to sampled reward sums (use bootstrapped estimate target)

====3: evaluate A^π(si,ai)=r(si,ai)+V^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)A^π(si​,ai​)=r(si​,ai​)+V^ϕπ​(si′​)−V^ϕπ​(si​)

====4: ∇θJ(θ)≈∑i∇θlog⁡πθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)∇θ​J(θ)≈∑i​∇θ​logπθ​(ai​∣si​)A^π(si​,ai​)

====5: θ←θ+α∇θJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)θ←θ+α∇θ​J(θ)

Policy evaluation
Batch actor-critic algorithm