RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Definition: Q-function
  • Definition: Value function
  • Using Q-functions and Value function
  • Idea 1
  • Idea 2

Was this helpful?

  1. Deep RL Course
  2. Intro to RL

Value functions and Q-functions

PreviousStructure of RL algorithmsNextTypes of RL algorithms

Last updated 5 years ago

Was this helpful?

Definition: Q-function

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Q^\pi (s_t,a_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]Qπ(st​,at​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​,at​]

Total reward from taking ata_tat​ in sts_tst​

Definition: Value function

Total reward from sts_tst​

Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​]

and can rewrite with Q-function. (The relation between Q-function and Value function)

Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]V^\pi(s_t)=E_{a_t\sim \pi_\theta (a_t|s_t)}[Q^\pi (s_t,a_t)]Vπ(st​)=Eat​∼πθ​(at​∣st​)​[Qπ(st​,at​)]

Besides, the RL objective can rewrite with value function

Using Q-functions and Value function

Idea 1

Idea 2

J(θ)=Es1∼p(s1)[Vπ(s1)]J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]J(θ)=Es1​∼p(s1​)​[Vπ(s1​)]

Qπ(st,at)⇒improve policy πQ^\pi (s_t,a_t) \Rightarrow \text{improve policy }\piQπ(st​,at​)⇒improve policy π

set π′(a∣s)=1\pi'(a|s)=1π′(a∣s)=1 if a=arg⁡max⁡aQπ(s,a)a=\arg\max_a Q^\pi (s,a)a=argmaxa​Qπ(s,a). This new policy π′\pi'π′ is at least as good as π\piπ (and probably better) and it doesn't matter what π\piπ is.

Compute gradient to increase probability of good actions aaa

if Qπ(s,a)>Vπ(s)Q^\pi(s,a)> V^\pi (s)Qπ(s,a)>Vπ(s), the aaa is better than average, recall that Vπ(s)=E[Qπ(s,a)]V^\pi(s)=E[Q^\pi(s,a)]Vπ(s)=E[Qπ(s,a)] under π(a∣s)\pi(a|s)π(a∣s)

So we can modify π(a∣s)\pi(a|s)π(a∣s) to increase probability of aaa, like compute gradient.