# Off-policy version PG

## Policy gradient is on-policy

$$
\nabla\_\theta J(\theta)=E\_{\tau\sim \pi\_\theta(\tau)}\[\nabla\_\theta\log\pi\_\theta (\tau)r(\tau)]
$$

This formula calculates the expectation of $$\nabla\_\theta\log\pi\_\theta (\tau)r(\tau)$$ under the same policy $$\pi\_\theta$$ , which is a big problem.Every time neural networks of policy changes a little bit, old samples have to be discarded and new samples under the new policy are needed. So on-policy learning can be extremely inefficient.

## Importance sampling & off-policy learning

We need to change the expectation under new policy to under old policy, and it can be solved by importance sampling.

$$
\begin{aligned}
E\_{x\sim p(x)}\[f(x)]
&= \int p(x)f(x)dx \\
&=\int \frac{q(x)}{q(x)}p(x)f(x)dx \\
&=\int q(x)\frac{p(x)}{q(x)}f(x)dx \\
&=E\_{x\sim q(x)}\left\[\frac{p(x)}{q(x)}f(x)\right]
\end{aligned}
$$

Now we don't have samples from current policy $$\pi\_\theta(\tau)$$, but we have samples from some $$\bar{\pi}(\tau)$$ instead.

$$
J(\theta)
\=E\_{\tau\sim\pi\_{\theta}(\tau)}\[r(\tau)]
\=E\_{\tau\sim \bar{\pi}(\tau)}\left\[\frac{\pi\_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)    \right]
$$

But, we don't know the distribution of $$\tau$$

$$
\frac{\pi\_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s\_1)\prod\_{t=1}^T \pi\_\theta (a\_t|s\_t)p(s\_{t+1}|s\_t,a\_t)}{p(s\_1)\prod\_{t=1}^T \bar{\pi}(a\_t|s\_t)p(s\_{t+1}|s\_t,a\_t)}=\frac{\prod\_{t=1}^T \pi\_\theta (a\_t|s\_t)}{\prod\_{t=1}^T \bar{\pi}(a\_t|s\_t)}
$$

We don't have to know transition probability and policies are what we only need to know.

## Deriving the policy gradient with IS

Can we estimate the value of some new parameters $$\theta'$$ under old parameters $$\theta$$ ?

$$
J(\theta')
\=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\frac{\pi\_{\theta'}(\tau)}{\pi\_\theta(\tau)}r(\tau)    \right]
$$

Since $$\pi\_{\theta'}(\tau)$$ is the only bit that depends on $$\theta'$$, we have

$$
\nabla\_{\theta'} J(\theta')
\=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\frac{\nabla\_{\theta'}\pi\_{\theta'}(\tau)}{\pi\_\theta(\tau)}r(\tau)    \right]
\=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\frac{\pi\_{\theta'}(\tau)}{\pi\_\theta(\tau)}\nabla\_{\theta'}\log\pi\_{\theta'}(\tau) r(\tau)    \right]
$$

## The off-policy policy gradient

Locally, $$\theta=\theta'$$

$$
\nabla\_{\theta'} J(\theta')
\=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\nabla\_{\theta'}\log\pi\_{\theta'}(\tau) r(\tau)    \right]
\=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\nabla\_{\theta}\log\pi\_{\theta}(\tau) r(\tau)    \right]
\=\nabla\_{\theta} J(\theta)
$$

It's same as on-policy policy gradient.

Globally, $$\theta\ne\theta'$$

$$
\begin{aligned}
\nabla\_{\theta'} J(\theta')
&=E\_{\tau\sim \pi\_\theta(\tau)}\left\[\frac{\pi\_{\theta'}(\tau)}{\pi\_\theta(\tau)}\nabla\_{\theta'}\log\pi\_{\theta'}(\tau) r(\tau)    \right] \\
&=E\_{\tau\sim \pi\_\theta(\tau)}
\left\[
\left(\prod\_{t=1}^T\frac{\pi\_{\theta'}(a\_t|s\_t)}{\pi\_{\theta}(a\_t|s\_t)}    \right)\
\left(\sum\_{t=1}^T\nabla\_{\theta'}\log\pi\_{\theta'}(a\_t|s\_t)     \right)
\left(\sum\_{t=1}^T r(s\_t,a\_t)\right)\
\right] \\
\end{aligned}
$$

Consider causality

$$
\begin{aligned}
\nabla\_{\theta'} J(\theta')
&=E\_{\tau\sim \pi\_\theta(\tau)}
\left\[
\sum\_{t=1}^T\nabla\_{\theta'}\log\pi\_{\theta'}(a\_t|s\_t)
\left(\prod\_{t'=1}^t \frac{\pi\_{\theta'}(a\_{t'}|s\_{t'})}{\pi\_{\theta}(a\_{t'}|s\_{t'})}    \right)\
\left(\sum\_{t'=t}^T r(s\_{t'},a\_{t'}) \left(\prod\_{t''=t}^{t'} \frac{\pi\_{\theta'}(a\_{t''}|s\_{t''})}{\pi\_{\theta}(a\_{t''}|s\_{t''})}    \right)        \right)
\right]     \\
\end{aligned}
$$

The first $$\Pi$$ means that future actions don't affect current weight. The second $$\Sigma$$ is reward to go. The second $$\Pi$$ means that the possibility from current $$t$$ to the future $$t'$$, and if we ignore this, we get a policy iteration algorithm.

## A first-order approximation

$$\prod\_{t'=1}^t \frac{\pi\_{\theta'}(a\_{t'}|s\_{t'})}{\pi\_{\theta}(a\_{t'}|s\_{t'})}$$ is exponential in T, and it will tend to $$0$$ or $$\infty$$, which is a big problem. This problem will be discussed in depth in Advanced Policy Gradient
