Evaluate the PG

From last chapter, the goal of RL is

\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

and denote $J(\theta)$ as

J(\theta)=E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

which is called the objective.

Evaluating the objective

Use Monte Carlo method to get samples and estimate $J(\theta)$

J(\theta)\approx \frac{1}{N}\sum_i\sum_t r(s_{i,t},a_{i,t})

Every sample(each $i$ ) is a trajectory over time $t$ from $\pi_\theta$ .

Direct policy gradient

We need to improve the objective $J(\theta)$ , so we can just take the derivative of $J(\theta)$ and then use gradient ascent. Denote $r(\tau)=\sum_{t=1}^T r(s_t,a_t)$

J(\theta)=E_{\tau\sim \pi_\theta (\tau)}[r(\tau)]=\int \pi_\theta (\tau)r(\tau)d\tau

So the derivative of $J(\theta)$ with respect to $\theta$ is

\nabla_\theta J(\theta) =\int \nabla _\theta \pi_\theta(\tau)r(\tau)d\tau =\int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau )d\tau =E_{\tau\sim \pi_\theta (\tau)}[\nabla_\theta \log \pi_\theta(\tau) r(\tau )]

The second equation uses a convenient identity (log trick)

\pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) =\pi_\theta(\tau) \frac{\nabla_\theta\pi_\theta(\tau) }{\pi_\theta(\tau) }=\nabla_\theta\pi_\theta(\tau)

From above, $J(\theta)$ is the expectation of $r(\tau)$ and $\nabla_\theta J(\theta)$ is the expectation of $\nabla_\theta \log \pi_\theta(\tau) r(\tau )$ , (with weight $\nabla_\theta \log \pi_\theta(\tau)$ ), where $\tau$ follows $\pi_\theta (\tau)$ .

We already know $r(\tau)=\sum_{t=1}^T r(s_t,a_t)$ , and what is $\nabla_\theta \log \pi_\theta(\tau)$ ?

\begin{aligned} \pi_\theta(\tau)&= p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) \\ \log \pi_\theta(\tau) &= \log p(s_1) +\sum_{t=1}^T [\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)]\\ \nabla_\theta \log \pi_\theta(\tau) &= \nabla_\theta \left[\log p(s_1) +\sum_{t=1}^T \left[\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)\right] \right] \\ &=\sum_{t=1}^T\nabla_\theta \log\pi_\theta (a_t|s_t) \\ \end{aligned}

The last equation is because some terms have nothing to do with $\theta$ . What is worthwhile to mention is that the log trick transfer $\prod$ to $\sum$ , which is friendly to $\nabla$ and easy to estimate.

Finally,

\nabla_\theta J(\theta) =E_{\tau\sim \pi_\theta (\tau)} \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_t|s_t) \right) \left(\sum_{t=1}^T r(s_t,a_t) \right)\right]

Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, $\pi_\theta(\tau)$ can be customized.

Evaluating the policy gradient

In practice, we can use $N$ trajectory samples and take the average.

\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]

Naturally, we obtain the following algorithms

REINFORCE algorithm:
repeat until converge
====1: sample { $\tau^i$ } from $\pi_\theta (a_t|s_t)$ (run the current policy)
====2: $\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{t}^i|s_{t}^i) \right) \left(\sum_{t=1}^T r(s_{t}^i,a_{t}^i) \right) \right]$
====3: $\theta\leftarrow \theta+ \alpha \nabla_\theta J(\theta)$

Continuous actions: Gaussian policies

For continuous actions, we can use Gaussian policy

\begin{aligned} &\pi_\theta(a_t|s_t)=\mathcal{N}(f_{\text{neural network}}(s_t);\Sigma) \\ &\log \pi_\theta(a_t|s_t) =-\frac{1}{2}||f(s_t)-a_t||^2_\Sigma +\text{const} \\ & \nabla_{\theta}\pi_\theta(a_t|s_t) =-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t))\frac{df}{d\theta} \end{aligned}

Partial observability

\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|o_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]

Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.

PreviousPolicy Gradient NextIntuition of PG

Last updated 6 years ago

Was this helpful?