Policy gradient is on-policy
∇ θ J ( θ ) = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ ) ] \nabla_\theta J(\theta)=E_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta (\tau)r(\tau)] ∇ θ J ( θ ) = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ )] This formula calculates the expectation of ∇ θ log π θ ( τ ) r ( τ ) \nabla_\theta\log\pi_\theta (\tau)r(\tau) ∇ θ log π θ ( τ ) r ( τ ) under the same policy π θ \pi_\theta π θ , which is a big problem.Every time neural networks of policy changes a little bit, old samples have to be discarded and new samples under the new policy are needed. So on-policy learning can be extremely inefficient.
Importance sampling & off-policy learning
We need to change the expectation under new policy to under old policy, and it can be solved by importance sampling.
E x ∼ p ( x ) [ f ( x ) ] = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q ( x ) [ p ( x ) q ( x ) f ( x ) ] \begin{aligned}
E_{x\sim p(x)}[f(x)]
&= \int p(x)f(x)dx \\
&=\int \frac{q(x)}{q(x)}p(x)f(x)dx \\
&=\int q(x)\frac{p(x)}{q(x)}f(x)dx \\
&=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]
\end{aligned} E x ∼ p ( x ) [ f ( x )] = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = E x ∼ q ( x ) [ q ( x ) p ( x ) f ( x ) ] Now we don't have samples from current policy π θ ( τ ) \pi_\theta(\tau) π θ ( τ ) , but we have samples from some π ˉ ( τ ) \bar{\pi}(\tau) π ˉ ( τ ) instead.
J ( θ ) = E τ ∼ π θ ( τ ) [ r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ π θ ( τ ) π ˉ ( τ ) r ( τ ) ] J(\theta)
=E_{\tau\sim\pi_{\theta}(\tau)}[r(\tau)]
=E_{\tau\sim \bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau) \right] J ( θ ) = E τ ∼ π θ ( τ ) [ r ( τ )] = E τ ∼ π ˉ ( τ ) [ π ˉ ( τ ) π θ ( τ ) r ( τ ) ] But, we don't know the distribution of τ \tau τ
π θ ( τ ) π ˉ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π ˉ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) = ∏ t = 1 T π θ ( a t ∣ s t ) ∏ t = 1 T π ˉ ( a t ∣ s t ) \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T \pi_\theta (a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T \pi_\theta (a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)} π ˉ ( τ ) π θ ( τ ) = p ( s 1 ) ∏ t = 1 T π ˉ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) = ∏ t = 1 T π ˉ ( a t ∣ s t ) ∏ t = 1 T π θ ( a t ∣ s t ) We don't have to know transition probability and policies are what we only need to know.
Deriving the policy gradient with IS
Can we estimate the value of some new parameters θ ′ \theta' θ ′ under old parameters θ \theta θ ?
J ( θ ′ ) = E τ ∼ π θ ( τ ) [ π θ ′ ( τ ) π θ ( τ ) r ( τ ) ] J(\theta')
=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right] J ( θ ′ ) = E τ ∼ π θ ( τ ) [ π θ ( τ ) π θ ′ ( τ ) r ( τ ) ] Since π θ ′ ( τ ) \pi_{\theta'}(\tau) π θ ′ ( τ ) is the only bit that depends on θ ′ \theta' θ ′ , we have
∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ ∇ θ ′ π θ ′ ( τ ) π θ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ π θ ′ ( τ ) π θ ( τ ) ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] \nabla_{\theta'} J(\theta')
=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau) \right]
=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] ∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ π θ ( τ ) ∇ θ ′ π θ ′ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ π θ ( τ ) π θ ′ ( τ ) ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] The off-policy policy gradient
Locally, θ = θ ′ \theta=\theta' θ = θ ′
∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ ) ] = ∇ θ J ( θ ) \nabla_{\theta'} J(\theta')
=E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right]
=E_{\tau\sim \pi_\theta(\tau)}\left[\nabla_{\theta}\log\pi_{\theta}(\tau) r(\tau) \right]
=\nabla_{\theta} J(\theta) ∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ ) ] = ∇ θ J ( θ ) It's same as on-policy policy gradient.
Globally, θ ≠ θ ′ \theta\ne\theta' θ = θ ′
∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ π θ ′ ( τ ) π θ ( τ ) ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ ( ∏ t = 1 T π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) ) ( ∑ t = 1 T ∇ θ ′ log π θ ′ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] \begin{aligned}
\nabla_{\theta'} J(\theta')
&=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau) r(\tau) \right] \\
&=E_{\tau\sim \pi_\theta(\tau)}
\left[
\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \right)
\left(\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \right)
\left(\sum_{t=1}^T r(s_t,a_t)\right)
\right] \\
\end{aligned} ∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ π θ ( τ ) π θ ′ ( τ ) ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] = E τ ∼ π θ ( τ ) [ ( t = 1 ∏ T π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) ) ( t = 1 ∑ T ∇ θ ′ log π θ ′ ( a t ∣ s t ) ) ( t = 1 ∑ T r ( s t , a t ) ) ] Consider causality
∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) [ ∑ t = 1 T ∇ θ ′ log π θ ′ ( a t ∣ s t ) ( ∏ t ′ = 1 t π θ ′ ( a t ′ ∣ s t ′ ) π θ ( a t ′ ∣ s t ′ ) ) ( ∑ t ′ = t T r ( s t ′ , a t ′ ) ( ∏ t ′ ′ = t t ′ π θ ′ ( a t ′ ′ ∣ s t ′ ′ ) π θ ( a t ′ ′ ∣ s t ′ ′ ) ) ) ] \begin{aligned}
\nabla_{\theta'} J(\theta')
&=E_{\tau\sim \pi_\theta(\tau)}
\left[
\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t)
\left(\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})} \right)
\left(\sum_{t'=t}^T r(s_{t'},a_{t'}) \left(\prod_{t''=t}^{t'} \frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_{\theta}(a_{t''}|s_{t''})} \right) \right)
\right] \\
\end{aligned} ∇ θ ′ J ( θ ′ ) = E τ ∼ π θ ( τ ) t = 1 ∑ T ∇ θ ′ log π θ ′ ( a t ∣ s t ) ( t ′ = 1 ∏ t π θ ( a t ′ ∣ s t ′ ) π θ ′ ( a t ′ ∣ s t ′ ) ) t ′ = t ∑ T r ( s t ′ , a t ′ ) t ′′ = t ∏ t ′ π θ ( a t ′′ ∣ s t ′′ ) π θ ′ ( a t ′′ ∣ s t ′′ ) The first Π \Pi Π means that future actions don't affect current weight. The second Σ \Sigma Σ is reward to go. The second Π \Pi Π means that the possibility from current t t t to the future t ′ t' t ′ , and if we ignore this, we get a policy iteration algorithm.
A first-order approximation
∏ t ′ = 1 t π θ ′ ( a t ′ ∣ s t ′ ) π θ ( a t ′ ∣ s t ′ ) \prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})} ∏ t ′ = 1 t π θ ( a t ′ ∣ s t ′ ) π θ ′ ( a t ′ ∣ s t ′ ) is exponential in T, and it will tend to 0 0 0 or ∞ \infty ∞ , which is a big problem. This problem will be discussed in depth in Advanced Policy Gradient