Value functions and Q-functions

Definition: Q-function

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Q^\pi (s_t,a_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]

Total reward from taking ata_t in sts_t

Definition: Value function

Total reward from sts_t

Vπ(st)=t=tTEπθ[r(st,at)st]V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]

and can rewrite with Q-function. (The relation between Q-function and Value function)

Vπ(st)=Eatπθ(atst)[Qπ(st,at)]V^\pi(s_t)=E_{a_t\sim \pi_\theta (a_t|s_t)}[Q^\pi (s_t,a_t)]

Besides, the RL objective can rewrite with value function

J(θ)=Es1p(s1)[Vπ(s1)]J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]

Using Q-functions and Value function

Idea 1

Qπ(st,at)improve policy πQ^\pi (s_t,a_t) \Rightarrow \text{improve policy }\pi

set π(as)=1\pi'(a|s)=1 if a=argmaxaQπ(s,a)a=\arg\max_a Q^\pi (s,a). This new policy π\pi' is at least as good as π\pi (and probably better) and it doesn't matter what π\pi is.

Idea 2

Compute gradient to increase probability of good actions aa

if Qπ(s,a)>Vπ(s)Q^\pi(s,a)> V^\pi (s), the aa is better than average, recall that Vπ(s)=E[Qπ(s,a)]V^\pi(s)=E[Q^\pi(s,a)] under π(as)\pi(a|s)

So we can modify π(as)\pi(a|s) to increase probability of aa, like compute gradient.

Last updated