RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Fitted Q-iteration
  • Why off-policy
  • Optimize what
  • Online Q-iteration
  • Exploration

Was this helpful?

  1. Deep RL Course
  2. Value Function Methods

Q iteration

PreviousValue iterationNextLearning theory

Last updated 5 years ago

Was this helpful?

Fitted Q-iteration

What if we don't know the transition dynamics? Use Q-values and fit Q function instead of V

Fitted Q-iteration algorithm:

repeat until converge:

====1: set yi←r(si,ai)+γmax⁡a′Qϕ(si′,ai′)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)yi​←r(si​,ai​)+γmaxa′​Qϕ​(si′​,ai′​)

====2: set ϕ←arg⁡min⁡ϕ12∑i∥Qϕ(si,ai)−yi∥2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2ϕ←argminϕ​21​∑i​∥Qϕ​(si​,ai​)−yi​∥2

+: works even for off-policy samples (unlike actor-critic)

+: only one network, no high-variance policy gradient

-: no convergence guarantees for non-linear function approximation

Full fitted Q-iteration algorithm:

repeat until converge:

====1: collect dataset {(si,ai,si′,ri}\{(s_i,a_i,s'_i,r_i\}{(si​,ai​,si′​,ri​} using some policy; (dataset size N, collection policy)

====repeat K times; (iterations K)

======== 2: set yi←r(si,ai)+γmax⁡a′Qϕ(si′,ai′)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)yi​←r(si​,ai​)+γmaxa′​Qϕ​(si′​,ai′​)

========3: set ϕ←arg⁡min⁡ϕ12∑i∥Qϕ(si,ai)−yi∥2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|Q_\phi(s_i,a_i)-y_i\|^2ϕ←argminϕ​21​∑i​∥Qϕ​(si​,ai​)−yi​∥2 ; (gradient steps S)

Why off-policy

Optimize what

Online Q-iteration

Online Q-iteration algorithm:

repeat until converge:

Exploration

Just using the final policy, which is deterministic, is a bad idea for step 1 to collect samples, because of exploration-exploitation problem. In general, there are other choices, since Q-iteration is a off-policy algorithm, such as:

epsilon-greedy:

Boltzmann exploration:

Given sss and aaa, transition is independent of π\piπ

In algorithm line 2, the max trick improves the policy (tabular case). And if we denote dataset distribution as β\betaβ, we have error:

E=12E(s,a)∼β[Qϕ(s,a)−[r(s,a)+γmax⁡a′Qϕ(s′,a′)]]\mathcal{E}=\frac{1}{2}\mathbb{E}_{(s,a)\sim\beta}\left[Q_\phi(s,a)-\left[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')\right] \right]E=21​E(s,a)∼β​[Qϕ​(s,a)−[r(s,a)+γa′max​Qϕ​(s′,a′)]]

If E=0\mathcal{E}=0E=0, then Qϕ(s,a)=r(s,a)+γmax⁡a′Qϕ(s′,a′)Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a')Qϕ​(s,a)=r(s,a)+γmaxa′​Qϕ​(s′,a′), which is the optimal Q-function, corresponding to optimal policy π′\pi'π′. But when we leave the tabular case and use neural network function approximation, most guarantee are lost.

Set N=1,K=1N=1,K=1N=1,K=1

====1: collect dataset {(si,ai,si′,ri}\{(s_i,a_i,s'_i,r_i\}{(si​,ai​,si′​,ri​} using some policy; (collection policy)

====2: set yi←r(si,ai)+γmax⁡a′Qϕ(si′,ai′)y_i\leftarrow r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)yi​←r(si​,ai​)+γmaxa′​Qϕ​(si′​,ai′​)

====3: set ϕ←ϕ−αdQϕd(si,ai)(Qϕ(si,ai)−yi)\phi\leftarrow \phi-\alpha\frac{dQ_\phi}{d}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i)ϕ←ϕ−αddQϕ​​(si​,ai​)(Qϕ​(si​,ai​)−yi​) ; (gradient steps S)

π(at∣st)={1−ϵif at=arg⁡max⁡atQϕ(st,at)ϵ/(∣A∣−1)otherwise\pi(a_t|s_t)= \begin{cases} 1-\epsilon &\text{if }a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)\\ \epsilon/(|\mathcal{A}|-1)&\text{otherwise} \end{cases}π(at​∣st​)={1−ϵϵ/(∣A∣−1)​if at​=argmaxat​​Qϕ​(st​,at​)otherwise​
π(at∣st)∝exp⁡(Qϕ(st,at))\pi(a_t|s_t)\propto \exp(Q_\phi(s_t,a_t))π(at​∣st​)∝exp(Qϕ​(st​,at​))
Q-Iteration,off-policy