RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Definition: Q-function
  • Definition: Value function
  • Using Q-functions and Value function
  • Idea 1
  • Idea 2

Was this helpful?

  1. Deep RL Course
  2. Intro to RL

Value functions and Q-functions

Definition: Q-function

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Q^\pi (s_t,a_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]Qπ(st​,at​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​,at​]

Total reward from taking ata_tat​ in sts_tst​

Definition: Value function

Total reward from sts_tst​

Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st​]

and can rewrite with Q-function. (The relation between Q-function and Value function)

Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]V^\pi(s_t)=E_{a_t\sim \pi_\theta (a_t|s_t)}[Q^\pi (s_t,a_t)]Vπ(st​)=Eat​∼πθ​(at​∣st​)​[Qπ(st​,at​)]

Besides, the RL objective can rewrite with value function

J(θ)=Es1∼p(s1)[Vπ(s1)]J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]J(θ)=Es1​∼p(s1​)​[Vπ(s1​)]

Using Q-functions and Value function

Idea 1

Qπ(st,at)⇒improve policy πQ^\pi (s_t,a_t) \Rightarrow \text{improve policy }\piQπ(st​,at​)⇒improve policy π

set π′(a∣s)=1\pi'(a|s)=1π′(a∣s)=1 if a=arg⁡max⁡aQπ(s,a)a=\arg\max_a Q^\pi (s,a)a=argmaxa​Qπ(s,a). This new policy π′\pi'π′ is at least as good as π\piπ (and probably better) and it doesn't matter what π\piπ is.

Idea 2

Compute gradient to increase probability of good actions aaa

if Qπ(s,a)>Vπ(s)Q^\pi(s,a)> V^\pi (s)Qπ(s,a)>Vπ(s), the aaa is better than average, recall that Vπ(s)=E[Qπ(s,a)]V^\pi(s)=E[Q^\pi(s,a)]Vπ(s)=E[Qπ(s,a)] under π(a∣s)\pi(a|s)π(a∣s)

So we can modify π(a∣s)\pi(a|s)π(a∣s) to increase probability of aaa, like compute gradient.

PreviousStructure of RL algorithmsNextTypes of RL algorithms

Last updated 5 years ago

Was this helpful?