RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Terminology and Notation
  • Definitions
  • Fully Observed
  • Partially Observed

Was this helpful?

  1. Deep RL Course
  2. Intro to RL

MDP Definition

PreviousIntro to RLNextRL Objective

Last updated 5 years ago

Was this helpful?

Terminology and Notation

Let's begin with Markov Decision Process (MDP).

Definitions

Fully Observed

Partially Observed

We choose action ata_tat​ at time ttt when we saw observation oto_tot​, and the latent state sts_tst​ transfered to st+1s_{t+1}st+1​, and we got reward r(s,t)r(s,t)r(s,t) from the environment.

Markov decision process M={S,A,T,r}\mathcal{M}=\{\mathcal{S,A,T,r}\}M={S,A,T,r}

S\mathcal{S}S : state space; states s∈Ss\in\mathcal{S}s∈S (discrete or continuous)

A\mathcal{A}A : action space; actions a∈Aa\in\mathcal{A}a∈A (discrete or continuous)

T\mathcal{T}T : transition operator, a tensor

rrr : reward function; r(st,at):S×A→Rr(s_t,a_t):\mathcal{S}\times\mathcal{A}\to \mathbb{R}r(st​,at​):S×A→R

partially observed Markov decision process M={S,A,O,T,E,r}\mathcal{M}=\{\mathcal{S,A,O,T,E, r}\}M={S,A,O,T,E,r}

S\mathcal{S}S : state space; states s∈Ss\in\mathcal{S}s∈S (discrete or continuous)

A\mathcal{A}A : action space; actions a∈Aa\in\mathcal{A}a∈A (discrete or continuous)

O\mathcal{O}O : observation space; observations o∈Oo\in\mathcal{O}o∈O (discrete or continuous)

T\mathcal{T}T : transition operator, a tensor

E\mathcal{E}E : emission probability p(ot∣st)p(o_t|s_t)p(ot​∣st​)

rrr : reward function; r(st,at):S×A→Rr(s_t,a_t):\mathcal{S}\times\mathcal{A}\to \mathbb{R}r(st​,at​):S×A→R

MDP Notation