RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Tabular value function
  • Non-tabular value function
  • Fitted Q-iteration

Was this helpful?

  1. Deep RL Course
  2. Value Function Methods

Learning theory

Tabular value function

Does tabular value iteration converge? and if so, converge to what? The value iteration update equation is

V(s)←max⁡a[r(s,a)+γE[V(s′)]]V(s)\leftarrow \max_a[ r(s,a)+\gamma \mathbb{E}[V(s')] ]V(s)←amax​[r(s,a)+γE[V(s′)]]

And a few notations:

rar_ara​: stacked vector of rewards at all states for action aaa

Ta\mathcal{T_a}Ta​: matrix of transitions for action aaa such that Ta,i,j=p(s′=i∣s=j,a)\mathcal{T}_{a,i,j}=p(s'=i|s=j,a)Ta,i,j​=p(s′=i∣s=j,a)

Then define an operator B:BV=max⁡a(ra+γTaV)\mathcal{B}: \mathcal{B}V=\max_a(r_a+\gamma\mathcal{T}_aV)B:BV=maxa​(ra​+γTa​V).

If V⋆V^\starV⋆ is a fixed point of B\mathcal{B}B, i.e. BV⋆=V⋆\mathcal{B}V^\star=V^\starBV⋆=V⋆, then V⋆(s)=r(s,a)+γE[V⋆(s′)]V^\star(s)=r(s,a)+\gamma\mathbb{E}[V^\star(s')]V⋆(s)=r(s,a)+γE[V⋆(s′)]

The good thing is that in tabular case, V⋆V^\starV⋆ always exists, is always unique, and always corresponds to the optimal policy. We can prove that value iteration reaches V⋆V^\starV⋆ because B\mathcal{B}B is a contraction: for any VVV and Vˉ\bar{V}Vˉ, we have ∥BV−BVˉ∥∞≤γ∥V−Vˉ∥∞\|\mathcal{B}V-\mathcal{B}\bar{V}\|_{\infty}\le \gamma\|V-\bar{V}\|_\infty∥BV−BVˉ∥∞​≤γ∥V−Vˉ∥∞​, and gap between VVV and Vˉ\bar{V}Vˉ always gets smaller by γ\gammaγ with respect to ∞\infty∞-norm.

Simply choosing V⋆V^\starV⋆ as Vˉ\bar{V}Vˉ, we got the proof.

Non-tabular value function

For tabular case, V←BVV\leftarrow \mathcal{B}VV←BV. For non-tabular case, we need to define new operator Π:ΠV=arg⁡min⁡V′∈Ω12∑∥V′(s)−V(s)∥2\Pi:\Pi V=\arg\min_{V'\in \Omega}\frac{1}{2}\sum\|V'(s)-V(s)\|^2Π:ΠV=argminV′∈Ω​21​∑∥V′(s)−V(s)∥2, where Π\PiΠ is a projection onto Ω\OmegaΩ, in terms of l2\mathcal{l}_2l2​.

So the fitted value iteration algorithm is V←ΠBVV\leftarrow \Pi\mathcal{B}VV←ΠBV

B\mathcal{B}B is a contraction w.r.t. ∞\infty∞-norm ("max" norm)

Π\PiΠ is a contraction w.r.t. l2\mathcal{l}_2l2​-norm (Euclidean distance)

but, ΠB\Pi\mathcal{B}ΠB is not a contraction to any kind.

So, value iteration converges in tabular cases, but not in fitted one.

Fitted Q-iteration

Same as above.

Fitted bootstrapped policy evaluation doesn't converge.

PreviousQ iterationNextDeep RL with Q-Function

Last updated 5 years ago

Was this helpful?