# Reduce variance

## What's wrong with the policy gradient ?

![Reduce variance](/files/-LjZ5jZSDU3iYQkrVNZW)

We knew that $$c$$, and there are many reasons that will cause high variance. Here is the most straightforward one.

Suppose we got a large negative reward and two small positive rewards, according to the update formula to $$\theta$$, the distribution of $$\tau$$ will move to the right a lot. However, if we add a constant to both positive and negative reward, here we have a small positive reward and two large positive rewards, the result is that the distribution of $$\tau$$ will move to the right a little bit.

From the above illustration, a sight change to rewards will severely influence the value of $$\nabla\_\theta J(\theta)$$, which will cause high variance.

The worst case is that, if the two good samples have $$r(\tau)=0$$, it may take a long to converge or end in a sub-optimal solution.

## Causality

$$
\nabla\_\theta J(\theta)
\=\frac{1}{N}\sum\_{i=1}^N
\left\[
\left(\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{i,t}|s\_{i,t})     \right)
\left(\sum\_{t=1}^T r(s\_{i,t},a\_{i,t})    \right)
\right]
$$

The causality is that policy at time $$t'$$ cannot affect reward at time $$t$$ when $$t\<t'$$. So the gradient should be

$$
\nabla\_\theta J(\theta)
\=\frac{1}{N}\sum\_{i=1}^N
\left\[
\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{i,t}|s\_{i,t})
\left(\sum\_{t'=t}^T r(s\_{i,t'},a\_{i,t'})    \right)
\right]
\=\frac{1}{N}\sum\_{i=1}^N
\left\[
\sum\_{t=1}^T \nabla\_\theta \log\pi\_\theta (a\_{i,t}|s\_{i,t})
\hat{Q}\_{i,t}
\right]
$$

The $$\hat{Q}\_{i,t}$$ is reward to go from time $$t$$ for sample $$i$$. Since $$t'=1$$ becomes $$t'=t$$ , the number of rewards reduces, which leads to lower variance.

The causality always works well, so can be used every time.

## Baseline

$$
\nabla\_\theta J(\theta)\approx\frac{1}{N}\sum\_i^N \nabla\_\theta \log \pi\_\theta (\tau\_i)r(\tau\_i)
$$

Our purpose is to make good trajectory have bigger probability and bad trajectory have smaller probability. The problem is that good trajectories don't always have big positive rewards and bad trajectories aren't always negative. Our methods is to subtract a baseline like that,

$$
\nabla\_\theta J(\theta)\approx\frac{1}{N}\sum\_i^N \nabla\_\theta \log \pi\_\theta (\tau\_i)(r(\tau\_i)-b)
$$

where $$b=\frac{1}{N}\sum\_{i=1}^N r(\tau\_i)$$

Actually, subtracting a baseline is unbiased in expectation

$$
E\[\nabla\_\theta\log\pi\_\theta(\tau)b]=\int \pi\_\theta(\tau)\nabla\_\theta\log\pi\_\theta(\tau)bd\tau=\int \nabla\_\theta \pi\_\theta(\tau) bd\tau=b\nabla\_\theta\int \pi\_\theta(\tau)d\tau=b\nabla\_\theta1=0
$$

The last thing worth mentioning is that average reward is not the best baseline, but it's pretty good.

## Analyzing variance

Subtracting a baseline is unbiased, but can we find the best baseline that has the lowest variance ?

$$
\nabla\_\theta J(\theta)=E\_{\tau\sim \pi\_\theta(\tau)}\[\nabla\_\theta\log\pi\_\theta (\tau)(r(\tau)-b)]
$$

And the variance

$$
\begin{aligned}
\text{Var}
&= E\[x^2]-E\[x]^2    \\
&=E\_{\tau\sim \pi\_\theta(\tau)}\[(\nabla\_\theta\log\pi\_\theta (\tau)(r(\tau)-b))^2]-E\_{\tau\sim \pi\_\theta(\tau)}\[\nabla\_\theta\log\pi\_\theta (\tau)(r(\tau)-b)]^2    \\
&=E\_{\tau\sim \pi\_\theta(\tau)}\[(\nabla\_\theta\log\pi\_\theta (\tau)(r(\tau)-b))^2]-E\_{\tau\sim \pi\_\theta(\tau)}\[\nabla\_\theta\log\pi\_\theta (\tau)r(\tau)]^2 \\
\end{aligned}
$$

The second equation is because baseline is unbiased in expectation.

Denote $$g(\tau)=\nabla\_\theta\log\pi\_\theta (\tau)$$

$$
\begin{aligned}
\frac{d\text{Var}}{db}
&=\frac{d}{db}E\[g(\tau)^2(r(\tau)-b)^2]    \\
&=\frac{d}{db}\left(E\[g(\tau)^2r(\tau)^2]-2E\[g(\tau)^2r(\tau)b]+b^2E\[g(\tau)]    \right)    \\
&=\frac{d}{db}\left(-2E\[g(\tau)^2r(\tau)b]+b^2E\[g(\tau)]    \right)    \\
&=-2E\[g(\tau)^2r(\tau)]+2bE\[g(\tau)]\
\=0
\end{aligned}
$$

So $$b$$ should be

$$
b=\frac{E\[g(\tau)^2r(\tau)]}{E\[g(\tau)}
$$

This is just expected reward, but weighted by gradient magnitudes.

In theory, the best baseline should be weighted expected reward, but in practice, we haven't find any difference between weighted expected reward and average reward. And since average have smaller computation, we just use average reward as baseline.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://drdh.gitbook.io/rl/deep-rl-course/policy-gradient/reduce-variance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
