Off-policy version PG
Last updated
Last updated
This formula calculates the expectation of under the same policy , which is a big problem.Every time neural networks of policy changes a little bit, old samples have to be discarded and new samples under the new policy are needed. So on-policy learning can be extremely inefficient.
We need to change the expectation under new policy to under old policy, and it can be solved by importance sampling.
Now we don't have samples from current policy , but we have samples from some instead.
We don't have to know transition probability and policies are what we only need to know.
It's same as on-policy policy gradient.
Consider causality
But, we don't know the distribution of
Can we estimate the value of some new parameters under old parameters ?
Since is the only bit that depends on , we have
Locally,
Globally,
The first means that future actions don't affect current weight. The second is reward to go. The second means that the possibility from current to the future , and if we ignore this, we get a policy iteration algorithm.
is exponential in T, and it will tend to or , which is a big problem. This problem will be discussed in depth in Advanced Policy Gradient