Value functions and Q-functions
Last updated
Last updated
set if . This new policy is at least as good as (and probably better) and it doesn't matter what is.
Compute gradient to increase probability of good actions
if , the is better than average, recall that under
So we can modify to increase probability of , like compute gradient.