Other advantages

Eligibility traces & n-step returns

V^ϕπ\hat{V}^\pi_\phi will bring bias and reward-sum will bring variance.

Critic

A^Cπ=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)\hat{A}^\pi_{C}=r(s_{t},a_{t})+\gamma \hat{V}^\pi_\phi(s_{t+1})-\hat{V}^\pi_\phi(s_{t})

+: lower variance

-: higher bias if value is wrong (it always is)

Monte Carlo

A^MCπ=t=tγttr(st,at)V^ϕπ(st)\hat{A}^\pi_{MC}=\sum_{t'=t}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}^\pi_\phi(s_{t})

+: no bias

-: higher variance (because single-sample estimate)

Can we combine these two, to control bias/variance tradeoff?

Reward declines due to discount factor γ\gamma . We can early cut.

A^nπ(st,at)=t=tt+nγttr(st,at)+γnV^ϕπ(st+n)V^ϕπ(st)\hat{A}^\pi_n(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n \hat{V}^\pi_\phi(s_{t+n})-\hat{V}^\pi_\phi(s_{t})

Choosing n>1n>1 often works better.

Generalized advantage estimate

Do we have to choose just one nn ? We can cut everywhere all at once.

Use weighted combination of n-step returns:

wnλn1w_n\propto \lambda^{n-1}

A^GAEπ(st,at)=n=1wnA^nπ(st,at)=r(st,at)+γ((1λ))V^ϕπ(st+1)+λ(r(st+1,at+1)+γ((1λ)V^ϕπ(st+2)+λr(st+2,at+2)+))=t=t(γλ)tt(r(st,at)+γV^ϕπ(st+1)V^ϕπ(st))=t=t(γλ)ttδt\begin{aligned} \hat{A}^\pi_{GAE}(s_t,a_t)&=\sum_{n=1}^\infty w_n \hat{A}^\pi_n(s_t,a_t) \\ &=r(s_t,a_t)+\gamma((1-\lambda))\hat{V}^\pi_\phi(s_{t+1})+\lambda(r(s_{t+1},a_{t+1})+\gamma((1-\lambda)\hat{V}^\pi_\phi(s_{t+2})+\lambda r(s_{t+2},a_{t+2})+\cdots))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}(r(s_{t'},a_{t'})+\gamma\hat{V}^\pi_\phi(s_{t'+1})-\hat{V}^\pi_\phi(s_{t'}))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}\delta_{t'} \end{aligned}

Last updated