Evaluate the PG

From last chapter, the goal of RL is

θ=argmaxθEτpθ(τ)[tr(st,at)]\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

and denote J(θ)J(\theta) as

J(θ)=Eτpθ(τ)[tr(st,at)]J(\theta)=E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

which is called the objective.

Evaluating the objective

Use Monte Carlo method to get samples and estimate J(θ)J(\theta)

J(θ)1Nitr(si,t,ai,t)J(\theta)\approx \frac{1}{N}\sum_i\sum_t r(s_{i,t},a_{i,t})

Every sample(each ii) is a trajectory over time tt from πθ\pi_\theta.

Direct policy gradient

We need to improve the objective J(θ)J(\theta), so we can just take the derivative of J(θ)J(\theta) and then use gradient ascent. Denote r(τ)=t=1Tr(st,at)r(\tau)=\sum_{t=1}^T r(s_t,a_t)

J(θ)=Eτπθ(τ)[r(τ)]=πθ(τ)r(τ)dτJ(\theta)=E_{\tau\sim \pi_\theta (\tau)}[r(\tau)]=\int \pi_\theta (\tau)r(\tau)d\tau

So the derivative of J(θ)J(\theta) with respect to θ\theta is

θJ(θ)=θπθ(τ)r(τ)dτ=πθ(τ)θlogπθ(τ)r(τ)dτ=Eτπθ(τ)[θlogπθ(τ)r(τ)]\nabla_\theta J(\theta) =\int \nabla _\theta \pi_\theta(\tau)r(\tau)d\tau =\int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau )d\tau =E_{\tau\sim \pi_\theta (\tau)}[\nabla_\theta \log \pi_\theta(\tau) r(\tau )]

The second equation uses a convenient identity (log trick)

πθ(τ)θlogπθ(τ)=πθ(τ)θπθ(τ)πθ(τ)=θπθ(τ)\pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) =\pi_\theta(\tau) \frac{\nabla_\theta\pi_\theta(\tau) }{\pi_\theta(\tau) }=\nabla_\theta\pi_\theta(\tau)

From above, J(θ)J(\theta) is the expectation of r(τ)r(\tau) and θJ(θ)\nabla_\theta J(\theta) is the expectation of θlogπθ(τ)r(τ)\nabla_\theta \log \pi_\theta(\tau) r(\tau ), (with weight θlogπθ(τ)\nabla_\theta \log \pi_\theta(\tau)), where τ\tau follows πθ(τ)\pi_\theta (\tau).

We already know r(τ)=t=1Tr(st,at)r(\tau)=\sum_{t=1}^T r(s_t,a_t), and what is θlogπθ(τ)\nabla_\theta \log \pi_\theta(\tau) ?

πθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)logπθ(τ)=logp(s1)+t=1T[logπθ(atst)+logp(st+1st,at)]θlogπθ(τ)=θ[logp(s1)+t=1T[logπθ(atst)+logp(st+1st,at)]]=t=1Tθlogπθ(atst)\begin{aligned} \pi_\theta(\tau)&= p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) \\ \log \pi_\theta(\tau) &= \log p(s_1) +\sum_{t=1}^T [\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)]\\ \nabla_\theta \log \pi_\theta(\tau) &= \nabla_\theta \left[\log p(s_1) +\sum_{t=1}^T \left[\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)\right] \right] \\ &=\sum_{t=1}^T\nabla_\theta \log\pi_\theta (a_t|s_t) \\ \end{aligned}

The last equation is because some terms have nothing to do with θ\theta. What is worthwhile to mention is that the log trick transfer \prod to \sum, which is friendly to \nabla and easy to estimate.

Finally,

θJ(θ)=Eτπθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]\nabla_\theta J(\theta) =E_{\tau\sim \pi_\theta (\tau)} \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_t|s_t) \right) \left(\sum_{t=1}^T r(s_t,a_t) \right)\right]

Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, πθ(τ)\pi_\theta(\tau) can be customized.

Evaluating the policy gradient

In practice, we can use NN trajectory samples and take the average.

θJ(θ)=1Ni=1N[(t=1Tθlogπθ(ai,tsi,t))(t=1Tr(si,t,ai,t))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]

Naturally, we obtain the following algorithms

REINFORCE algorithm:

repeat until converge

====1: sample {τi\tau^i} from πθ(atst)\pi_\theta (a_t|s_t) (run the current policy)

====2: θJ(θ)=1Ni=1N[(t=1Tθlogπθ(atisti))(t=1Tr(sti,ati))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{t}^i|s_{t}^i) \right) \left(\sum_{t=1}^T r(s_{t}^i,a_{t}^i) \right) \right]

====3: θθ+αθJ(θ)\theta\leftarrow \theta+ \alpha \nabla_\theta J(\theta)

Continuous actions: Gaussian policies

For continuous actions, we can use Gaussian policy

πθ(atst)=N(fneural network(st);Σ)logπθ(atst)=12f(st)atΣ2+constθπθ(atst)=12Σ1(f(st)at))dfdθ\begin{aligned} &\pi_\theta(a_t|s_t)=\mathcal{N}(f_{\text{neural network}}(s_t);\Sigma) \\ &\log \pi_\theta(a_t|s_t) =-\frac{1}{2}||f(s_t)-a_t||^2_\Sigma +\text{const} \\ & \nabla_{\theta}\pi_\theta(a_t|s_t) =-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t))\frac{df}{d\theta} \end{aligned}

Partial observability

θJ(θ)=1Ni=1N[(t=1Tθlogπθ(ai,toi,t))(t=1Tr(si,t,ai,t))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|o_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]

Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.

Last updated

Was this helpful?