RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Tabular value iteration
  • Fitted value iteration

Was this helpful?

  1. Deep RL Course
  2. Value Function Methods

Value iteration

PreviousPolicy iterationNextQ iteration

Last updated 5 years ago

Was this helpful?

Tabular value iteration

Considering that arg⁡max⁡atAπ(st,at)=arg⁡max⁡atVπ(st,at)\arg \max_{a_t}A^\pi(s_t,a_t)=\arg \max_{a_t}V^\pi(s_t,a_t)argmaxat​​Aπ(st​,at​)=argmaxat​​Vπ(st​,at​), we can just use QπQ^\piQπ instead of AπA^\piAπ.

Qπ(s,a)=r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]Q^\pi(s,a)=r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s')]Qπ(s,a)=r(s,a)+γEs′∼p(s′∣s,a)​[Vπ(s′)]

Apart from that, the policy can be calculated by arg⁡max⁡aQ(s,a)\arg\max_{a}Q(s,a)argmaxa​Q(s,a). So skip the policy and compute values directly:

value iteration algorithm:

repeat until converge:

====1: set Q(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[V(s′)]Q(s,a)\leftarrow r(s,a)+\gamma \mathbb{E}_{s'\sim p(s'|s,a)}[V(s')]Q(s,a)←r(s,a)+γEs′∼p(s′∣s,a)​[V(s′)]

====2: set V(s)←max⁡aQ(s,a)V(s)\leftarrow \max_a Q(s,a)V(s)←maxa​Q(s,a)

Fitted value iteration

The question is how do we represent V(s)V(s)V(s). For small cases, we can use a big table, one entry for each discrete sss, but it is not appropriate for real world, especially for image inputs, due to the curse of dimensionality. In this case, neural net function can be used: V:S→RV: \mathcal{S}\to \mathbb{R}V:S→R

L(ϕ)=12∥Vϕ(s)−max⁡aQπ(s,a)∥2\mathcal{L}(\phi)=\frac{1}{2}\left\|V_\phi(s)-\max_{a}Q^\pi(s,a) \right\|^2L(ϕ)=21​​Vϕ​(s)−amax​Qπ(s,a)​2

Fitted value iteration

repeat until converge

====1: set yi←max⁡ai(r(si,ai)+γE[Vϕ(si′)])y_i\leftarrow \max_{a_i}(r(s_i,a_i)+\gamma \mathbb{E}[V_\phi(s'_i)])yi​←maxai​​(r(si​,ai​)+γE[Vϕ​(si′​)])

====2: set ϕ←arg⁡min⁡ϕ12∑i∥Vϕ(si)−yi∥2\phi\leftarrow \arg\min_\phi\frac{1}{2}\sum_i\|V_\phi(s_i)-y_i\|^2ϕ←argminϕ​21​∑i​∥Vϕ​(si​)−yi​∥2

Value Iteration