Other advantages

Eligibility traces & n-step returns

$\hat{V}^\pi_\phi$ will bring bias and reward-sum will bring variance.

Critic

\hat{A}^\pi_{C}=r(s_{t},a_{t})+\gamma \hat{V}^\pi_\phi(s_{t+1})-\hat{V}^\pi_\phi(s_{t})

+: lower variance

-: higher bias if value is wrong (it always is)

Monte Carlo

\hat{A}^\pi_{MC}=\sum_{t'=t}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}^\pi_\phi(s_{t})

+: no bias

-: higher variance (because single-sample estimate)

Can we combine these two, to control bias/variance tradeoff?

Reward declines due to discount factor $\gamma$ . We can early cut.

\hat{A}^\pi_n(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n \hat{V}^\pi_\phi(s_{t+n})-\hat{V}^\pi_\phi(s_{t})

Choosing $n>1$ often works better.

Do we have to choose just one $n$ ? We can cut everywhere all at once.

Use weighted combination of n-step returns:

$w_n\propto \lambda^{n-1}$

\begin{aligned} \hat{A}^\pi_{GAE}(s_t,a_t)&=\sum_{n=1}^\infty w_n \hat{A}^\pi_n(s_t,a_t) \\ &=r(s_t,a_t)+\gamma((1-\lambda))\hat{V}^\pi_\phi(s_{t+1})+\lambda(r(s_{t+1},a_{t+1})+\gamma((1-\lambda)\hat{V}^\pi_\phi(s_{t+2})+\lambda r(s_{t+2},a_{t+2})+\cdots))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}(r(s_{t'},a_{t'})+\gamma\hat{V}^\pi_\phi(s_{t'+1})-\hat{V}^\pi_\phi(s_{t'}))\\ &=\sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}\delta_{t'} \end{aligned}

Last updated 5 years ago

Was this helpful?