RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Tabular value function
  • Non-tabular value function
  • Fitted Q-iteration

Was this helpful?

  1. Deep RL Course
  2. Value Function Methods

Learning theory

PreviousQ iterationNextDeep RL with Q-Function

Last updated 5 years ago

Was this helpful?

Tabular value function

Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is

V(s)←max⁡a[r(s,a)+γE[V(s′)]]V(s)\leftarrow \max_a[ r(s,a)+\gamma \mathbb{E}[V(s')] ]V(s)←amax​[r(s,a)+γE[V(s′)]]

And a few notations:

rar_ara​: stacked vector of rewards at all states for action aaa

Ta\mathcal{T_a}Ta​: matrix of transitions for action aaa such that Ta,i,j=p(s′=i∣s=j,a)\mathcal{T}_{a,i,j}=p(s'=i|s=j,a)Ta,i,j​=p(s′=i∣s=j,a)

Then define an operator B:BV=max⁡a(ra+γTaV)\mathcal{B}: \mathcal{B}V=\max_a(r_a+\gamma\mathcal{T}_aV)B:BV=maxa​(ra​+γTa​V).

If V⋆V^\starV⋆ is a fixed point of B\mathcal{B}B, i.e. BV⋆=V⋆\mathcal{B}V^\star=V^\starBV⋆=V⋆, then V⋆(s)=r(s,a)+γE[V⋆(s′)]V^\star(s)=r(s,a)+\gamma\mathbb{E}[V^\star(s')]V⋆(s)=r(s,a)+γE[V⋆(s′)]

The good thing is that in tabular case, V⋆V^\starV⋆ always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches V⋆V^\starV⋆ because B\mathcal{B}B is a contraction: for any VVV and Vˉ\bar{V}Vˉ, we have ∥BV−BVˉ∥∞≤γ∥V−Vˉ∥∞\|\mathcal{B}V-\mathcal{B}\bar{V}\|_{\infty}\le \gamma\|V-\bar{V}\|_\infty∥BV−BVˉ∥∞​≤γ∥V−Vˉ∥∞​, and gap between VVV and Vˉ\bar{V}Vˉ always gets smaller by γ\gammaγ with respect to ∞\infty∞-norm.

Simply choosing V⋆V^\starV⋆ as Vˉ\bar{V}Vˉ, we got the proof.

Non-tabular value function

For tabular case, V←BVV\leftarrow \mathcal{B}VV←BV. For non-tabular case, we need to define new operator Π:ΠV=arg⁡min⁡V′∈Ω12∑∥V′(s)−V(s)∥2\Pi:\Pi V=\arg\min_{V'\in \Omega}\frac{1}{2}\sum\|V'(s)-V(s)\|^2Π:ΠV=argminV′∈Ω​21​∑∥V′(s)−V(s)∥2, where Π\PiΠ is a projection onto Ω\OmegaΩ, in terms of l2\mathcal{l}_2l2​.

So the fitted value iteration algorithm is V←ΠBVV\leftarrow \Pi\mathcal{B}VV←ΠBV

So, value iteration converges in tabular cases, but not in fitted one.

Fitted Q-iteration

Same as above.

Fitted bootstrapped policy evaluation doesn't converge.

B\mathcal{B}B is a contraction w.r.t. ∞\infty∞-norm ("max" norm)

Π\PiΠ is a contraction w.r.t. l2\mathcal{l}_2l2​-norm (Euclidean distance)

but, ΠB\Pi\mathcal{B}ΠB is not a contraction to any kind.