Critic as state-dependent baselines
Actor-critic
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t)) +: lower variance (due to critic)
-: not unbiased (if the critic is not perfect)
Policy gradient
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′)−b) +: no bias
-: higher variance (because single-sample estimate)
can we use V^ϕπ and still keep the estimate unbiased ?
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′)−V^ϕπ(si,t)) +: no bias
+: lower variance (baseline is closer to rewards)
Control variates: action-dependent baselines
In theory
Aπ(st,at)=Qπ(st,at)−Vπ(st) Option 1:
A^π(st,at)=t′=t∑∞γt′−tr(st′,at′)−Vϕπ(st) +: no bias
-: higher variance (because single-sample estimate)
Option 2:
A^π(st,at)=t′=t∑∞γt′−tr(st′,at′)−Qϕπ(st,at) +: goes to zero in expectation is critic is correct
-: not correct
Combination:
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+N1i=1∑Nt=1∑T∇θEa∼πθ(at∣st)[Qϕπ(si,t,ai,t)]