Value functions and Q-functions
Definition: Q-function
Total reward from taking in
Definition: Value function
Total reward from
and can rewrite with Q-function. (The relation between Q-function and Value function)
Besides, the RL objective can rewrite with value function
Using Q-functions and Value function
Idea 1
set if . This new policy is at least as good as (and probably better) and it doesn't matter what is.
Idea 2
Compute gradient to increase probability of good actions
if , the is better than average, recall that under
So we can modify to increase probability of , like compute gradient.
Last updated
Was this helpful?