Evaluate the PG

From last chapter, the goal of RL is

which is called the objective.

Evaluating the objective

Direct policy gradient

The second equation uses a convenient identity (log trick)

Finally,

Evaluating the policy gradient

Naturally, we obtain the following algorithms

REINFORCE algorithm:

repeat until converge

Continuous actions: Gaussian policies

For continuous actions, we can use Gaussian policy

Partial observability

Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.

Last updated