Baselines

Critic as state-dependent baselines

Actor-critic

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)V^ϕπ(si,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})-\hat{V}^\pi_\phi(s_{i,t}) \right)

+: lower variance (due to critic)

-: not unbiased (if the critic is not perfect)

Policy gradient

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t)b)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)

+: no bias

-: higher variance (because single-sample estimate)

can we use V^ϕπ\hat{V}^\pi_\phi and still keep the estimate unbiased ?

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t)V^ϕπ(si,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}^\pi_\phi(s_{i,t})\right)

+: no bias

+: lower variance (baseline is closer to rewards)

Control variates: action-dependent baselines

In theory

Aπ(st,at)=Qπ(st,at)Vπ(st)A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)

Option 1:

A^π(st,at)=t=tγttr(st,at)Vϕπ(st)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infty \gamma^{t'-t}r(s_{t'},a_{t'})-V^\pi_\phi(s_t)

+: no bias

-: higher variance (because single-sample estimate)

Option 2:

A^π(st,at)=t=tγttr(st,at)Qϕπ(st,at)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infty \gamma^{t'-t}r(s_{t'},a_{t'})-Q^\pi_\phi(s_t,a_t)

+: goes to zero in expectation is critic is correct

-: not correct

Combination:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Q^i,tQϕπ(si,t,ai,t))+1Ni=1Nt=1TθEaπθ(atst)[Qϕπ(si,t,ai,t)]\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\hat{Q}_{i,t}- Q^\pi_\phi(s_{i,t},a_{i,t})\right) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \mathbb{E}_{a\sim\pi_\theta(a_t|s_t)}\left[Q^\pi_\phi(s_{i,t},a_{i,t}) \right]

Last updated