PG in practice

Policy gradient with automatic differentiation: ignored.

Gradient has high variance and will be noisy. Because this isn't the same as supervised learning.
Consider using much larger batches.
Tweaking learning rates $\alpha$ is very hard, and adaptive step size rules like ADAM can be OK-ish.

PreviousOff-policy version PG NextActor-Critic Algorithms

Last updated 6 years ago