Fit what ?
So now we need to fit one of Qπ,Vπ,Aπ. The question is fit what to what? We knew:
Qπ(st,at)Vπ(st)Aπ(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]=Eat∼πθ(at∣st)[Qπ(st,at)]=Qπ(st,at)−Vπ(st) Firstly, both Qπ and Aπ can be calculated or approximated by Vπ:
Qπ(st,at)Aπ(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]≈r(st,at)+t′=t+1∑TEπθ[r(st′,at′)∣st,at]≈r(st,at)+Vπ(st+1)≈r(st,at)+Vπ(st+1)−Vπ(st) Apart from this, both Qπ and Aπ need 2 inputs st,at, but Vπ only need st, which may be easier.
So let's just fit Vπ.
Fitted to what?
Vπ(st)=t′=t∑TEπθ[r(st′,at′)∣st] Fit Vπ(st), evaluate how good is the policy, which is called policy evaluation. As what policy gradient does, we can use Monte Carlo policy evaluation.
Vπ(st)≈t′=t∑Tr(st′,at′) if we are able to reset the simulator, we can do more than one samples:
Vπ(st)≈N1i=1∑Nt′=t∑Tr(st′,at′) But the former is still pretty good, so the supervised learning condition is:
training data: {(si,t,∑t′=tTr(si,t′,ai,t′))}. Define target yi,t=∑t′=tTr(si,t′,ai,t′).
supervised regression: L(ϕ)=21∑i∣∣V^ϕπ(si)−yi∣∣2
Bootstrapped estimate
The Monte Carlo target yi is not perfect. Can we do better? The ideal target is:
yi,t=t′=t∑TEπθ[r(si,t′,ai,t′)∣si,t]≈r(si,t,ai,t)+t′=t+1∑TEπθ[r(si,t′,ai,t′)∣si,t]≈r(si,t,ai,t)+Vπ(st+1)≈r(si,t,ai,t)+V^ϕπ(st+1) Sample one step and the directly use previous fitted value function.
This estimate will cause more bias but lower variance.
Algorithm
Batch actor-critic algorithm:
repeat until converge:
====1: sample {si,ai} from πθ(a∣s) (run it on the robot)
====2: fit V^ϕπ(s) to sampled reward sums (use bootstrapped estimate target)
====3: evaluate A^π(si,ai)=r(si,ai)+V^ϕπ(si′)−V^ϕπ(si)
====4: ∇θJ(θ)≈∑i∇θlogπθ(ai∣si)A^π(si,ai)
====5: θ←θ+α∇θJ(θ)