RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Correlated samples
  • Solution 1.
  • Solution 2.
  • Unstable target

Was this helpful?

  1. Deep RL Course
  2. Deep RL with Q-Function

Correlated samples and unstable target

PreviousDeep RL with Q-FunctionNextThe accuracy of Q-function

Last updated 5 years ago

Was this helpful?

From the online Q-iteration algorithm, we could find two apparent problems. The first is that the samples, which are observed using some policy, are correlated. Why this is a problem will be discussed later. And the second is that the target value is always changing.

Take sine function as an example, we will never regress to it if we observe the curve through time (x-axis), since the sequential points are strongly correlated and the target value is always changing.

Correlated samples

Solution 1.

We need samples in independently and identically distributed way.

Solution 2.

As any policies can be used to get samples(off-policy), we can store samples in a buffer and then load from it, which may be i.i.d. enough. This method is called replay buffer.

full Q-learning with replay buffer

repeat until converge:

====1: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s'_i,r_i)\}{(si​,ai​,si′​,ri​)}using some policy, add it to B\mathcal{B}B

==== repeat K times: (K=1 is common,though large K is more efficient)

======== 2: sample a batch (si,ai,si′,ri)(s_i,a_i,s'_i,r_i)(si​,ai​,si′​,ri​) from B\mathcal{B}B

========3: ϕ←ϕ−α∑idQϕ(si,ai)dϕ(Qϕ(si,ai)−[r(si,ai)+γmax⁡a′Qϕ(si′,ai′)])\phi\leftarrow \phi -\alpha \sum_i \frac{dQ_\phi(s_i,a_i)}{d\phi}\left(Q_\phi(s_i,a_i)-\left[r(s_i,a_i)+\gamma \max_{a'}Q_\phi(s'_i,a'_i)\right] \right)ϕ←ϕ−α∑i​dϕdQϕ​(si​,ai​)​(Qϕ​(si​,ai​)−[r(si​,ai​)+γmaxa′​Qϕ​(si′​,ai′​)])

+: samples are no longer correlated

+: multiple samples in the batch (low-variance gradient)

Unstable target

In online Q-learning algorithm, the target is changed every step, which may cause instability compared to regression problem in supervised learning.

Q-learning with replay buffer and target network:

repeat until converge:

==== 1: save target network parameter: ϕ′←ϕ\phi'\leftarrow \phiϕ′←ϕ

====repeat N times:

========2: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s'_i,r_i)\}{(si​,ai​,si′​,ri​)}using some policy, add it to B\mathcal{B}B

========repeat K times:

============ 3: sample a batch (si,ai,si′,ri)(s_i,a_i,s'_i,r_i)(si​,ai​,si′​,ri​) from B\mathcal{B}B

============4: ϕ←ϕ−α∑idQϕ(si,ai)dϕ(Qϕ(si,ai)−[r(si,ai)+γmax⁡a′Qϕ′(si′,ai′)])\phi\leftarrow \phi -\alpha \sum_i \frac{dQ_\phi(s_i,a_i)}{d\phi}\left(Q_\phi(s_i,a_i)-\left[r(s_i,a_i)+\gamma \max_{a'}Q_{\phi'}(s'_i,a'_i)\right] \right)ϕ←ϕ−α∑i​dϕdQϕ​(si​,ai​)​(Qϕ​(si​,ai​)−[r(si​,ai​)+γmaxa′​Qϕ′​(si′​,ai′​)])

Notice that at line 4 the target value is now using ϕ′\phi'ϕ′ instead of ϕ\phiϕ.

The purpose is to make the target unchanged in a smaller loop, which is more like a standard regression procedure and is more stable. However, the price is that everything takes longer to finish than before.

Sine function
parallel sampling
Replay Buffer