# Evaluate the PG

From last chapter, the goal of RL is

$$
\theta^\star=\arg\max\_{\theta} E\_{\tau\sim p\_\theta(\tau)}\left\[\sum\_t r(s\_t,a\_t)\right]
$$

and denote $$J(\theta)$$ as

$$
J(\theta)=E\_{\tau\sim p\_\theta(\tau)}\left\[\sum\_t r(s\_t,a\_t)\right]
$$

which is called the objective.

## Evaluating the objective

Use Monte Carlo method to get samples and estimate $$J(\theta)$$

$$
J(\theta)\approx \frac{1}{N}\sum\_i\sum\_t r(s\_{i,t},a\_{i,t})
$$

Every sample(each $$i$$) is a trajectory over time $$t$$ from $$\pi\_\theta$$.

![Evaluate objective](/files/-LjVHpxP4lm9mHZsyfrc)

## Direct policy gradient

We need to improve the objective $$J(\theta)$$, so we can just take the derivative of $$J(\theta)$$ and then use gradient ascent. Denote $$r(\tau)=\sum\_{t=1}^T r(s\_t,a\_t)$$

$$
J(\theta)=E\_{\tau\sim \pi\_\theta (\tau)}\[r(\tau)]=\int \pi\_\theta (\tau)r(\tau)d\tau
$$

So the derivative of $$J(\theta)$$ with respect to $$\theta$$ is

$$
\nabla\_\theta J(\theta)
\=\int \nabla *\theta \pi*\theta(\tau)r(\tau)d\tau
\=\int \pi\_\theta(\tau) \nabla\_\theta \log \pi\_\theta(\tau) r(\tau )d\tau
\=E\_{\tau\sim \pi\_\theta (\tau)}\[\nabla\_\theta \log \pi\_\theta(\tau) r(\tau )]
$$

The second equation uses a convenient identity (log trick)

$$
\pi\_\theta(\tau) \nabla\_\theta \log \pi\_\theta(\tau) =\pi\_\theta(\tau) \frac{\nabla\_\theta\pi\_\theta(\tau) }{\pi\_\theta(\tau) }=\nabla\_\theta\pi\_\theta(\tau)
$$

From above, $$J(\theta)$$ is the expectation of $$r(\tau)$$ and $$\nabla\_\theta J(\theta)$$ is the expectation of $$\nabla\_\theta \log \pi\_\theta(\tau) r(\tau )$$, (with weight $$\nabla\_\theta \log \pi\_\theta(\tau)$$), where $$\tau$$ follows $$\pi\_\theta (\tau)$$.

We already know $$r(\tau)=\sum\_{t=1}^T r(s\_t,a\_t)$$, and what is $$\nabla\_\theta \log \pi\_\theta(\tau)$$ ?

$$
\begin{aligned}
\pi\_\theta(\tau)&= p(s\_1)\prod\_{t=1}^T\pi\_\theta(a\_t|s\_t)p(s\_{t+1}|s\_t,a\_t) \\
\log \pi\_\theta(\tau) &= \log p(s\_1) +\sum\_{t=1}^T \[\log\pi\_\theta (a\_t|s\_t)+\log p(s\_{t+1}|s\_t,a\_t)]\\
\nabla\_\theta \log \pi\_\theta(\tau) &= \nabla\_\theta \left\[\log p(s\_1)  +\sum\_{t=1}^T \left\[\log\pi\_\theta (a\_t|s\_t)+\log p(s\_{t+1}|s\_t,a\_t)\right]     \right]    \\
&=\sum\_{t=1}^T\nabla\_\theta \log\pi\_\theta (a\_t|s\_t) \\
\end{aligned}
$$

The last equation is because some terms have nothing to do with $$\theta$$. What is worthwhile to mention is that the log trick transfer $$\prod$$ to $$\sum$$, which is friendly to $$\nabla$$ and easy to estimate.

Finally,

$$
\nabla\_\theta J(\theta)
\=E\_{\tau\sim \pi\_\theta (\tau)} \left\[\left(\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_t|s\_t)     \right)
\left(\sum\_{t=1}^T r(s\_t,a\_t)    \right)\right]
$$

Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, $$\pi\_\theta(\tau)$$ can be customized.

## Evaluating the policy gradient

In practice, we can use $$N$$ trajectory samples and take the average.

$$
\nabla\_\theta J(\theta)
\=\frac{1}{N}\sum\_{i=1}^N
\left\[
\left(\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{i,t}|s\_{i,t})     \right)
\left(\sum\_{t=1}^T r(s\_{i,t},a\_{i,t})    \right)
\right]
$$

Naturally, we obtain the following algorithms

> REINFORCE algorithm:
>
> repeat until converge
>
> \====1: sample {$$\tau^i$$} from $$\pi\_\theta (a\_t|s\_t)$$ (run the current policy)
>
> \====2: $$\nabla\_\theta J(\theta) =\frac{1}{N}\sum\_{i=1}^N \left\[\left(\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{t}^i|s\_{t}^i) \right) \left(\sum\_{t=1}^T r(s\_{t}^i,a\_{t}^i) \right) \right]$$
>
> \====3: $$\theta\leftarrow \theta+ \alpha \nabla\_\theta J(\theta)$$

![REINFORCE algorithm](/files/-LjWMv2HJ8Dgk9qqNtqn)

## Continuous actions: Gaussian policies

For continuous actions, we can use Gaussian policy

$$
\begin{aligned}
&\pi\_\theta(a\_t|s\_t)=\mathcal{N}(f\_{\text{neural network}}(s\_t);\Sigma)    \\
&\log \pi\_\theta(a\_t|s\_t) =-\frac{1}{2}||f(s\_t)-a\_t||^2\_\Sigma +\text{const}    \\
& \nabla\_{\theta}\pi\_\theta(a\_t|s\_t) =-\frac{1}{2}\Sigma^{-1}(f(s\_t)-a\_t))\frac{df}{d\theta}
\end{aligned}
$$

## Partial observability

![partial observability](/files/-LjWMv2JokXdOyYBBXCt)

$$
\nabla\_\theta J(\theta)
\=\frac{1}{N}\sum\_{i=1}^N
\left\[
\left(\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{i,t}|o\_{i,t})     \right)
\left(\sum\_{t=1}^T r(s\_{i,t},a\_{i,t})    \right)
\right]
$$

Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://drdh.gitbook.io/rl/deep-rl-course/policy-gradient/evaluate-the-pg.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
