RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • The goal of reinforcement learning
  • Two cases
  • Finite horizon case: state-action marginal
  • Infinite horizon case: stationary distribution

Was this helpful?

  1. Deep RL Course
  2. Intro to RL

RL Objective

PreviousMDP DefinitionNextStructure of RL algorithms

Last updated 5 years ago

Was this helpful?

The goal of reinforcement learning

Trajectory Probability

pθ(s1,a1,⋯ ,sT,aT)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)p_\theta (s_1,a_1,\cdots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)pθ​(s1​,a1​,⋯,sT​,aT​)=p(s1​)t=1∏T​πθ​(at​∣st​)p(st+1​∣st​,at​)

We usually denote s1,a1,⋯ ,sT,aTs_1,a_1,\cdots,s_T,a_Ts1​,a1​,⋯,sT​,aT​ as τ\tauτ. The left part is trajectory probability and the right part is Markov Chain.

What we really need is to find an optimal θ\thetaθ (denoted as θ⋆\theta^\starθ⋆)

θ⋆=arg⁡max⁡θEτ∼pθ(τ)[∑tr(st,at)]\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]θ⋆=argθmax​Eτ∼pθ​(τ)​[t∑​r(st​,at​)]

Two cases

p(st+1,at+1∣st,at)=p(st+1∣st,at)πθ(at+1∣st+1)p(s_{t+1},a_{t+1}|s_t,a_t)=p(s_{t+1}|s_t,a_t)\pi_\theta(a_{t+1}|s_{t+1})p(st+1​,at+1​∣st​,at​)=p(st+1​∣st​,at​)πθ​(at+1​∣st+1​)

Finite horizon case: state-action marginal

θ⋆=arg⁡max⁡θ∑t=1TE(st,at)∼pθ(st,at)[r(st,at)]\theta^\star=\arg\max_\theta \sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]θ⋆=argθmax​t=1∑T​E(st​,at​)∼pθ​(st​,at​)​[r(st​,at​)]

where pθ(st,at)p_\theta(s_t,a_t)pθ​(st​,at​) is called state-action marginal.

Infinite horizon case: stationary distribution

θ⋆=arg⁡max⁡θ1T∑t=1TE(st,at)∼pθ(st,at)[r(st,at)]→E(s,a)∼pθ(s,a)[r(s,a)]\theta^\star =\arg\max_{\theta}\frac{1}{T}\sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \to E_{(s,a)\sim p_\theta(s,a)}[r(s,a)]θ⋆=argθmax​T1​t=1∑T​E(st​,at​)∼pθ​(st​,at​)​[r(st​,at​)]→E(s,a)∼pθ​(s,a)​[r(s,a)]

where μ=pθ(s,a)\mu=p_\theta(s,a)μ=pθ​(s,a) is called stationary distribution.

Thus μ=Tμ\mu =\mathcal{T}\muμ=Tμ , the meaning is that stationary = the same before and after transition

μ\muμ is eigenvector of T\mathcal{T}T with eigenvalue 1, and it always exists under some regularity conditions.

what is RL
p(s,a)