Evaluate the PG
From last chapter, the goal of RL is
which is called the objective.
Evaluating the objective
Direct policy gradient
The second equation uses a convenient identity (log trick)
Finally,
Evaluating the policy gradient
Naturally, we obtain the following algorithms
REINFORCE algorithm:
repeat until converge
Continuous actions: Gaussian policies
For continuous actions, we can use Gaussian policy
Partial observability
Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.
Last updated