Eligibility traces & n-step returns
V^ϕπ will bring bias and reward-sum will bring variance.
Critic
A^Cπ=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st) +: lower variance
-: higher bias if value is wrong (it always is)
Monte Carlo
A^MCπ=t′=t∑∞γt′−tr(st′,at′)−V^ϕπ(st) +: no bias
-: higher variance (because single-sample estimate)
Can we combine these two, to control bias/variance tradeoff?
Reward declines due to discount factor γ . We can early cut.
A^nπ(st,at)=t′=t∑t+nγt′−tr(st′,at′)+γnV^ϕπ(st+n)−V^ϕπ(st) Choosing n>1 often works better.
Generalized advantage estimate
Do we have to choose just one n ? We can cut everywhere all at once.
Use weighted combination of n-step returns:
wn∝λn−1
A^GAEπ(st,at)=n=1∑∞wnA^nπ(st,at)=r(st,at)+γ((1−λ))V^ϕπ(st+1)+λ(r(st+1,at+1)+γ((1−λ)V^ϕπ(st+2)+λr(st+2,at+2)+⋯))=t′=t∑∞(γλ)t′−t(r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′))=t′=t∑∞(γλ)t′−tδt′