Value functions and Q-functions

Definition: Q-function

Q^\pi (s_t,a_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]

Total reward from taking $a_t$ in $s_t$

Definition: Value function

Total reward from $s_t$

V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]

and can rewrite with Q-function. (The relation between Q-function and Value function)

V^\pi(s_t)=E_{a_t\sim \pi_\theta (a_t|s_t)}[Q^\pi (s_t,a_t)]

Besides, the RL objective can rewrite with value function

J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]

Using Q-functions and Value function

Idea 1

$Q^\pi (s_t,a_t) \Rightarrow \text{improve policy }\pi$

set $\pi'(a|s)=1$ if $a=\arg\max_a Q^\pi (s,a)$ . This new policy $\pi'$ is at least as good as $\pi$ (and probably better) and it doesn't matter what $\pi$ is.

Idea 2

Compute gradient to increase probability of good actions $a$

if $Q^\pi(s,a)> V^\pi (s)$ , the $a$ is better than average, recall that $V^\pi(s)=E[Q^\pi(s,a)]$ under $\pi(a|s)$

So we can modify $\pi(a|s)$ to increase probability of $a$ , like compute gradient.

PreviousStructure of RL algorithms NextTypes of RL algorithms

Last updated 6 years ago

Was this helpful?