Off-policy version PG

Policy gradient is on-policy

θJ(θ)=Eτπθ(τ)[θlogπθ(τ)r(τ)]\nabla_\theta J(\theta)=E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)r(\tau)]

This formula calculates the expectation of θlogπθ(τ)r(τ)\nabla_\theta\log\pi_\theta (\tau)r(\tau) under the same policy πθ\pi_\theta , which is a big problem.Every time neural networks of policy changes a little bit, old samples have to be discarded and new samples under the new policy are needed. So on-policy learning can be extremely inefficient.

Importance sampling & off-policy learning

We need to change the expectation under new policy to under old policy, and it can be solved by importance sampling.

Exp(x)[f(x)]=p(x)f(x)dx=q(x)q(x)p(x)f(x)dx=q(x)p(x)q(x)f(x)dx=Exq(x)[p(x)q(x)f(x)]\begin{aligned} E_{x\sim p(x)}[f(x)] &= \int p(x)f(x)dx \\ &=\int \frac{q(x)}{q(x)}p(x)f(x)dx \\ &=\int q(x)\frac{p(x)}{q(x)}f(x)dx \\ &=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{aligned}

Now we don't have samples from current policy πθ(τ)\pi_\theta(\tau), but we have samples from some πˉ(τ)\bar{\pi}(\tau) instead.

J(θ)=Eτπθ(τ)[r(τ)]=Eτπˉ(τ)[πθ(τ)πˉ(τ)r(τ)]J(\theta) =E_{\tau\sim\pi_{\theta}(\tau)}[r(\tau)] =E_{\tau\sim \bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau) \right]

But, we don't know the distribution of τ\tau

πθ(τ)πˉ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)p(s1)t=1Tπˉ(atst)p(st+1st,at)=t=1Tπθ(atst)t=1Tπˉ(atst)\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T \pi_\theta (a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T \pi_\theta (a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}

We don't have to know transition probability and policies are what we only need to know.

Deriving the policy gradient with IS

Can we estimate the value of some new parameters θ\theta' under old parameters θ\theta ?

J(θ)=Eτπθ(τ)[πθ(τ)πθ(τ)r(τ)]J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right]

Since πθ(τ)\pi_{\theta'}(\tau) is the only bit that depends on θ\theta', we have

θJ(θ)=Eτπθ(τ)[θπθ(τ)πθ(τ)r(τ)]=Eτπθ(τ)[πθ(τ)πθ(τ)θlogπθ(τ)r(τ)]\nabla_{\theta'} J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right] =E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right]

The off-policy policy gradient

Locally, θ=θ\theta=\theta'

θJ(θ)=Eτπθ(τ)[θlogπθ(τ)r(τ)]=Eτπθ(τ)[θlogπθ(τ)r(τ)]=θJ(θ)\nabla_{\theta'} J(\theta') =E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] =E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta}\log\pi_{\theta}(\tau) r(\tau) \right] =\nabla_{\theta} J(\theta)

It's same as on-policy policy gradient.

Globally, θθ\theta\ne\theta'

θJ(θ)=Eτπθ(τ)[πθ(τ)πθ(τ)θlogπθ(τ)r(τ)]=Eτπθ(τ)[(t=1Tπθ(atst)πθ(atst))(t=1Tθlogπθ(atst))(t=1Tr(st,at))]\begin{aligned} \nabla_{\theta'} J(\theta') &=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] \\ &=E_{\tau\sim \pi_\theta(\tau)} \left[ \left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \right) \left(\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \right) \left(\sum_{t=1}^T r(s_t,a_t)\right) \right] \\ \end{aligned}

Consider causality

θJ(θ)=Eτπθ(τ)[t=1Tθlogπθ(atst)(t=1tπθ(atst)πθ(atst))(t=tTr(st,at)(t=ttπθ(atst)πθ(atst)))]\begin{aligned} \nabla_{\theta'} J(\theta') &=E_{\tau\sim \pi_\theta(\tau)} \left[ \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \left(\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})} \right) \left(\sum_{t'=t}^T r(s_{t'},a_{t'}) \left(\prod_{t''=t}^{t'} \frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_{\theta}(a_{t''}|s_{t''})} \right) \right) \right] \\ \end{aligned}

The first Π\Pi means that future actions don't affect current weight. The second Σ\Sigma is reward to go. The second Π\Pi means that the possibility from current tt to the future tt', and if we ignore this, we get a policy iteration algorithm.

A first-order approximation

t=1tπθ(atst)πθ(atst)\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})} is exponential in T, and it will tend to 00 or \infty, which is a big problem. This problem will be discussed in depth in Advanced Policy Gradient

Last updated