RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • What's wrong with the policy gradient ?
  • Causality
  • Baseline
  • Analyzing variance

Was this helpful?

  1. Deep RL Course
  2. Policy Gradient

Reduce variance

PreviousIntuition of PGNextOff-policy version PG

Last updated 5 years ago

Was this helpful?

What's wrong with the policy gradient ?

We knew that ccc, and there are many reasons that will cause high variance. Here is the most straightforward one.

Suppose we got a large negative reward and two small positive rewards, according to the update formula to θ\thetaθ, the distribution of τ\tauτ will move to the right a lot. However, if we add a constant to both positive and negative reward, here we have a small positive reward and two large positive rewards, the result is that the distribution of τ\tauτ will move to the right a little bit.

From the above illustration, a sight change to rewards will severely influence the value of ∇θJ(θ)\nabla_\theta J(\theta)∇θ​J(θ), which will cause high variance.

The worst case is that, if the two good samples have r(τ)=0r(\tau)=0r(τ)=0, it may take a long to converge or end in a sub-optimal solution.

Causality

∇θJ(θ)=1N∑i=1N[(∑t=1T∇θlog⁡πθ(ai,t∣si,t))(∑t=1Tr(si,t,ai,t))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]∇θ​J(θ)=N1​i=1∑N​[(t=1∑T​∇θ​logπθ​(ai,t​∣si,t​))(t=1∑T​r(si,t​,ai,t​))]

The causality is that policy at time t′t't′ cannot affect reward at time ttt when t<t′t<t't<t′. So the gradient should be

∇θJ(θ)=1N∑i=1N[∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTr(si,t′,ai,t′))]=1N∑i=1N[∑t=1T∇θlog⁡πθ(ai,t∣si,t)Q^i,t]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}) \right) \right] =\frac{1}{N}\sum_{i=1}^N \left[ \sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \right]∇θ​J(θ)=N1​i=1∑N​[t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)(t′=t∑T​r(si,t′​,ai,t′​))]=N1​i=1∑N​[t=1∑T​∇θ​logπθ​(ai,t​∣si,t​)Q^​i,t​]

The Q^i,t\hat{Q}_{i,t}Q^​i,t​ is reward to go from time ttt for sample iii. Since t′=1t'=1t′=1 becomes t′=tt'=tt′=t , the number of rewards reduces, which leads to lower variance.

The causality always works well, so can be used every time.

Baseline

∇θJ(θ)≈1N∑iN∇θlog⁡πθ(τi)r(τi)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i^N \nabla_\theta \log \pi_\theta (\tau_i)r(\tau_i)∇θ​J(θ)≈N1​i∑N​∇θ​logπθ​(τi​)r(τi​)

Our purpose is to make good trajectory have bigger probability and bad trajectory have smaller probability. The problem is that good trajectories don't always have big positive rewards and bad trajectories aren't always negative. Our methods is to subtract a baseline like that,

∇θJ(θ)≈1N∑iN∇θlog⁡πθ(τi)(r(τi)−b)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i^N \nabla_\theta \log \pi_\theta (\tau_i)(r(\tau_i)-b)∇θ​J(θ)≈N1​i∑N​∇θ​logπθ​(τi​)(r(τi​)−b)

where b=1N∑i=1Nr(τi)b=\frac{1}{N}\sum_{i=1}^N r(\tau_i)b=N1​∑i=1N​r(τi​)

Actually, subtracting a baseline is unbiased in expectation

E[∇θlog⁡πθ(τ)b]=∫πθ(τ)∇θlog⁡πθ(τ)bdτ=∫∇θπθ(τ)bdτ=b∇θ∫πθ(τ)dτ=b∇θ1=0E[\nabla_\theta\log\pi_\theta(\tau)b]=\int \pi_\theta(\tau)\nabla_\theta\log\pi_\theta(\tau)bd\tau=\int \nabla_\theta \pi_\theta(\tau) bd\tau=b\nabla_\theta\int \pi_\theta(\tau)d\tau=b\nabla_\theta1=0E[∇θ​logπθ​(τ)b]=∫πθ​(τ)∇θ​logπθ​(τ)bdτ=∫∇θ​πθ​(τ)bdτ=b∇θ​∫πθ​(τ)dτ=b∇θ​1=0

The last thing worth mentioning is that average reward is not the best baseline, but it's pretty good.

Analyzing variance

Subtracting a baseline is unbiased, but can we find the best baseline that has the lowest variance ?

∇θJ(θ)=Eτ∼πθ(τ)[∇θlog⁡πθ(τ)(r(τ)−b)]\nabla_\theta J(\theta)=E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)(r(\tau)-b)]∇θ​J(θ)=Eτ∼πθ​(τ)​[∇θ​logπθ​(τ)(r(τ)−b)]

And the variance

Var=E[x2]−E[x]2=Eτ∼πθ(τ)[(∇θlog⁡πθ(τ)(r(τ)−b))2]−Eτ∼πθ(τ)[∇θlog⁡πθ(τ)(r(τ)−b)]2=Eτ∼πθ(τ)[(∇θlog⁡πθ(τ)(r(τ)−b))2]−Eτ∼πθ(τ)[∇θlog⁡πθ(τ)r(τ)]2\begin{aligned} \text{Var} &= E[x^2]-E[x]^2 \\ &=E_{\tau\sim \pi_\theta(\tau)}[(\nabla_\theta\log\pi_\theta (\tau)(r(\tau)-b))^2]-E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)(r(\tau)-b)]^2 \\ &=E_{\tau\sim \pi_\theta(\tau)}[(\nabla_\theta\log\pi_\theta (\tau)(r(\tau)-b))^2]-E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)r(\tau)]^2 \\ \end{aligned}Var​=E[x2]−E[x]2=Eτ∼πθ​(τ)​[(∇θ​logπθ​(τ)(r(τ)−b))2]−Eτ∼πθ​(τ)​[∇θ​logπθ​(τ)(r(τ)−b)]2=Eτ∼πθ​(τ)​[(∇θ​logπθ​(τ)(r(τ)−b))2]−Eτ∼πθ​(τ)​[∇θ​logπθ​(τ)r(τ)]2​

The second equation is because baseline is unbiased in expectation.

Denote g(τ)=∇θlog⁡πθ(τ)g(\tau)=\nabla_\theta\log\pi_\theta (\tau)g(τ)=∇θ​logπθ​(τ)

dVardb=ddbE[g(τ)2(r(τ)−b)2]=ddb(E[g(τ)2r(τ)2]−2E[g(τ)2r(τ)b]+b2E[g(τ)])=ddb(−2E[g(τ)2r(τ)b]+b2E[g(τ)])=−2E[g(τ)2r(τ)]+2bE[g(τ)]=0\begin{aligned} \frac{d\text{Var}}{db} &=\frac{d}{db}E[g(\tau)^2(r(\tau)-b)^2] \\ &=\frac{d}{db}\left(E[g(\tau)^2r(\tau)^2]-2E[g(\tau)^2r(\tau)b]+b^2E[g(\tau)] \right) \\ &=\frac{d}{db}\left(-2E[g(\tau)^2r(\tau)b]+b^2E[g(\tau)] \right) \\ &=-2E[g(\tau)^2r(\tau)]+2bE[g(\tau)] =0 \end{aligned}dbdVar​​=dbd​E[g(τ)2(r(τ)−b)2]=dbd​(E[g(τ)2r(τ)2]−2E[g(τ)2r(τ)b]+b2E[g(τ)])=dbd​(−2E[g(τ)2r(τ)b]+b2E[g(τ)])=−2E[g(τ)2r(τ)]+2bE[g(τ)]=0​

So bbb should be

b=E[g(τ)2r(τ)]E[g(τ)b=\frac{E[g(\tau)^2r(\tau)]}{E[g(\tau)}b=E[g(τ)E[g(τ)2r(τ)]​

This is just expected reward, but weighted by gradient magnitudes.

In theory, the best baseline should be weighted expected reward, but in practice, we haven't find any difference between weighted expected reward and average reward. And since average have smaller computation, we just use average reward as baseline.

Reduce variance