From last chapter, the goal of RL is
θ⋆=argθmaxEτ∼pθ(τ)[t∑r(st,at)] and denote J(θ) as
J(θ)=Eτ∼pθ(τ)[t∑r(st,at)] which is called the objective.
Evaluating the objective
Use Monte Carlo method to get samples and estimate J(θ)
J(θ)≈N1i∑t∑r(si,t,ai,t) Every sample(each i) is a trajectory over time t from πθ.
Direct policy gradient
We need to improve the objective J(θ), so we can just take the derivative of J(θ) and then use gradient ascent. Denote r(τ)=∑t=1Tr(st,at)
J(θ)=Eτ∼πθ(τ)[r(τ)]=∫πθ(τ)r(τ)dτ So the derivative of J(θ) with respect to θ is
∇θJ(θ)=∫∇θπθ(τ)r(τ)dτ=∫πθ(τ)∇θlogπθ(τ)r(τ)dτ=Eτ∼πθ(τ)[∇θlogπθ(τ)r(τ)] The second equation uses a convenient identity (log trick)
πθ(τ)∇θlogπθ(τ)=πθ(τ)πθ(τ)∇θπθ(τ)=∇θπθ(τ) From above, J(θ) is the expectation of r(τ) and ∇θJ(θ) is the expectation of ∇θlogπθ(τ)r(τ), (with weight ∇θlogπθ(τ)), where τ follows πθ(τ).
We already know r(τ)=∑t=1Tr(st,at), and what is ∇θlogπθ(τ) ?
πθ(τ)logπθ(τ)∇θlogπθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)=logp(s1)+t=1∑T[logπθ(at∣st)+logp(st+1∣st,at)]=∇θ[logp(s1)+t=1∑T[logπθ(at∣st)+logp(st+1∣st,at)]]=t=1∑T∇θlogπθ(at∣st) The last equation is because some terms have nothing to do with θ. What is worthwhile to mention is that the log trick transfer ∏ to ∑, which is friendly to ∇ and easy to estimate.
Finally,
∇θJ(θ)=Eτ∼πθ(τ)[(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))] Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, πθ(τ) can be customized.
Evaluating the policy gradient
In practice, we can use N trajectory samples and take the average.
∇θJ(θ)=N1i=1∑N[(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))] Naturally, we obtain the following algorithms
REINFORCE algorithm:
repeat until converge
====1: sample {τi} from πθ(at∣st) (run the current policy)
====2: ∇θJ(θ)=N1∑i=1N[(∑t=1T∇θlogπθ(ati∣sti))(∑t=1Tr(sti,ati))]
====3: θ←θ+α∇θJ(θ)
Continuous actions: Gaussian policies
For continuous actions, we can use Gaussian policy
πθ(at∣st)=N(fneural network(st);Σ)logπθ(at∣st)=−21∣∣f(st)−at∣∣Σ2+const∇θπθ(at∣st)=−21Σ−1(f(st)−at))dθdf Partial observability
∇θJ(θ)=N1i=1∑N[(t=1∑T∇θlogπθ(ai,t∣oi,t))(t=1∑Tr(si,t,ai,t))] Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.