RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Policy iteration
  • Dynamic programming

Was this helpful?

  1. Deep RL Course
  2. Value Function Methods

Policy iteration

PreviousValue Function MethodsNextValue iteration

Last updated 5 years ago

Was this helpful?

Policy iteration

Last chapter we discuss actor-critic algorithms, but what if we just use critic (value), without an actor(policy)? Surely we can do this, by extracting a policy from a value function.

Firstly, Aπ(st,at)A^\pi(s_t,a_t)Aπ(st​,at​) measures how much is ata_tat​ than the average action according to π\piπ, and arg⁡max⁡atAπ(st,at)\arg \max_{a_t}A^\pi(s_t,a_t)argmaxat​​Aπ(st​,at​) means the best action from sts_tst​ if we then follow π\piπ, which can be viewed as a substitute for policy π(st∣at)\pi(s_t|a_t)π(st​∣at​). Notice that the "policy" obtained by max trick is at least as good as any normal policy, regardless of what π(at∣sa)\pi(a_t|s_a)π(at​∣sa​) is.

So forget policies, let's just use max trick.

π′(at∣st)={1if at=arg⁡max⁡atAπ(st,at)0otherwise\pi'(a_t|s_t)= \begin{cases} 1&\text{if }a_t=\arg \max_{a_t}A^\pi(s_t,a_t)\\ 0 &\text{otherwise} \end{cases}π′(at​∣st​)={10​if at​=argmaxat​​Aπ(st​,at​)otherwise​

At a high level, we got the policy iteration algorithms:

Policy Iteration:

repeat until converge

====1: evaluate Aπ(st,at)A^\pi(s_t,a_t)Aπ(st​,at​)

====2: set π←π′\pi\leftarrow \pi'π←π′

As before Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s)A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^\pi(s')]-V^\pi(s)Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s). So now the key problem is to evaluate Vπ(s)V^\pi(s)Vπ(s).

Dynamic programming

First of all, some basic hypothesis: assume we know the dynamic p(s′∣s,a)p(s'|s,a)p(s′∣s,a), and sss and aaa are both discrete (and small). For example:

16 states, 4 actions per state. Can store full Vπ(s)V^\pi(s)Vπ(s) in a table. T\mathcal{T}T is a 16×16×416\times16\times416×16×4 tensor.

We can use bootstrapped update:

Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′s)]]V^\pi(s)\leftarrow \mathbb{E}_{a\sim\pi(a|s)}\left[r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s's)]\right]Vπ(s)←Ea∼π(a∣s)​[r(s,a)+γEs′∼p(s′∣s,a)​[Vπ(s′s)]]

Since we use deterministic policy π(s)=a\pi(s)=aπ(s)=a, the update equation can be even simplified:

Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))[Vπ(s′s)]V^\pi(s)\leftarrow r(s,\pi(s))+\gamma\mathbb{E}_{s'\sim p(s'|s,\pi(s))}[V^\pi(s's)]Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))​[Vπ(s′s)]

Simplified Policy Iteration:

repeat until converge:

==== 1: evaluate Vπ(s)V^\pi(s)Vπ(s)

==== 2: set π←π′\pi\leftarrow \pi'π←π′

Policy Iteration
Dynamic Programming
Simplified policy iteration