Definition: Q-function
Qπ(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at] Total reward from taking at in st
Definition: Value function
Total reward from st
Vπ(st)=t′=t∑TEπθ[r(st′,at′)∣st] and can rewrite with Q-function. (The relation between Q-function and Value function)
Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)] Besides, the RL objective can rewrite with value function
J(θ)=Es1∼p(s1)[Vπ(s1)] Using Q-functions and Value function
Idea 1
Qπ(st,at)⇒improve policy π
set π′(a∣s)=1 if a=argmaxaQπ(s,a). This new policy π′ is at least as good as π (and probably better) and it doesn't matter what π is.
Idea 2
Compute gradient to increase probability of good actions a
if Qπ(s,a)>Vπ(s), the a is better than average, recall that Vπ(s)=E[Qπ(s,a)] under π(a∣s)
So we can modify π(a∣s) to increase probability of a, like compute gradient.