Infinite cases
By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if T is ∞? In many cases, V^ϕπ can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.
We have γ chances to die every step:
So the new target is:
yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1) And discount factor γ∈[0,1]. (0.99 works well)
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t))≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)A^ϕπ(si,t,ai,t) Discount factors for policy gradient
In Monte Carlo policy gradients, we have 2 options:
option 1:
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′)) option 2:
∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tγt−1r(si,t,ai,t)) Consider causality:
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−1r(si,t′,ai,t′))≈N1i=1∑Nt=1∑Tγt−1∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′)) option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances γ to die every step. But option 1 is what we choose.
Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.
Actor-critic algorithms (with discount)
batch version
batch actor-critic algorithm:
repeat until converge:
====1: sample {si,ai} from πθ(a∣s) (run it on the robot)
====2: fit V^ϕπ(s) to sampled reward sums (use bootstrapped estimate target)
====3: evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)
====4: ∇θJ(θ)≈∑i∇θlogπθ(ai∣si)A^π(si,ai)
====5: θ←θ+α∇θJ(θ)
online version
online actor-critic algorithm:
repeat until converge:
====1: take action a∼πθ(a∣s), get (s,a,s′,r)
====2: update V^ϕπ using target r+γV^ϕπ(s′)
====3: evaluate A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)
====4: ∇θJ(θ)≈∇θlogπθ(a∣s)A^π(s,a)
====5: θ←θ+α∇θJ(θ)