Off-policy version PG

Policy gradient is on-policy

\nabla_\theta J(\theta)=E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)r(\tau)]

This formula calculates the expectation of $\nabla_\theta\log\pi_\theta (\tau)r(\tau)$ under the same policy $\pi_\theta$ , which is a big problem.Every time neural networks of policy changes a little bit, old samples have to be discarded and new samples under the new policy are needed. So on-policy learning can be extremely inefficient.

Importance sampling & off-policy learning

We need to change the expectation under new policy to under old policy, and it can be solved by importance sampling.

\begin{aligned} E_{x\sim p(x)}[f(x)] &= \int p(x)f(x)dx \\ &=\int \frac{q(x)}{q(x)}p(x)f(x)dx \\ &=\int q(x)\frac{p(x)}{q(x)}f(x)dx \\ &=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{aligned}

Now we don't have samples from current policy $\pi_\theta(\tau)$ , but we have samples from some $\bar{\pi}(\tau)$ instead.

J(\theta) =E_{\tau\sim\pi_{\theta}(\tau)}[r(\tau)] =E_{\tau\sim \bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau) \right]

But, we don't know the distribution of $\tau$

\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T \pi_\theta (a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T \pi_\theta (a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}

We don't have to know transition probability and policies are what we only need to know.

Deriving the policy gradient with IS

Can we estimate the value of some new parameters $\theta'$ under old parameters $\theta$ ?

J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right]

Since $\pi_{\theta'}(\tau)$ is the only bit that depends on $\theta'$ , we have

\nabla_{\theta'} J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right] =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right]

The off-policy policy gradient

Locally, $\theta=\theta'$

\nabla_{\theta'} J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] =E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta}\log\pi_{\theta}(\tau) r(\tau) \right] =\nabla_{\theta} J(\theta)

It's same as on-policy policy gradient.

Globally, $\theta\ne\theta'$

\begin{aligned} \nabla_{\theta'} J(\theta') &=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] \\ &=E_{\tau\sim \pi_\theta(\tau)} \left[ \left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \right) \left(\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \right) \left(\sum_{t=1}^T r(s_t,a_t)\right) \right] \\ \end{aligned}

Consider causality

\begin{aligned} \nabla_{\theta'} J(\theta') &=E_{\tau\sim \pi_\theta(\tau)} \left[ \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \left(\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})} \right) \left(\sum_{t'=t}^T r(s_{t'},a_{t'}) \left(\prod_{t''=t}^{t'} \frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_{\theta}(a_{t''}|s_{t''})} \right) \right) \right] \\ \end{aligned}

The first $\Pi$ means that future actions don't affect current weight. The second $\Sigma$ is reward to go. The second $\Pi$ means that the possibility from current $t$ to the future $t'$ , and if we ignore this, we get a policy iteration algorithm.

A first-order approximation

$\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})}$ is exponential in T, and it will tend to $0$ or $\infty$ , which is a big problem. This problem will be discussed in depth in Advanced Policy Gradient

PreviousReduce variance NextPG in practice

Last updated 6 years ago

Was this helpful?