Evaluate the PG
Last updated
Last updated
From last chapter, the goal of RL is
and denote as
which is called the objective.
Use Monte Carlo method to get samples and estimate
Every sample(each ) is a trajectory over time from .
The second equation uses a convenient identity (log trick)
Finally,
Naturally, we obtain the following algorithms
REINFORCE algorithm:
repeat until converge
For continuous actions, we can use Gaussian policy
Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.
We need to improve the objective , so we can just take the derivative of and then use gradient ascent. Denote
So the derivative of with respect to is
From above, is the expectation of and is the expectation of , (with weight ), where follows .
We already know , and what is ?
The last equation is because some terms have nothing to do with . What is worthwhile to mention is that the log trick transfer to , which is friendly to and easy to estimate.
Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, can be customized.
In practice, we can use trajectory samples and take the average.
====1: sample {} from (run the current policy)
====2:
====3: