RL Objective

The goal of reinforcement learning

Trajectory Probability

pθ(s1,a1,,sT,aT)=p(s1)t=1Tπθ(atst)p(st+1st,at)p_\theta (s_1,a_1,\cdots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

We usually denote s1,a1,,sT,aTs_1,a_1,\cdots,s_T,a_T as τ\tau. The left part is trajectory probability and the right part is Markov Chain.

What we really need is to find an optimal θ\theta (denoted as θ\theta^\star)

θ=argmaxθEτpθ(τ)[tr(st,at)]\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

Two cases

p(st+1,at+1st,at)=p(st+1st,at)πθ(at+1st+1)p(s_{t+1},a_{t+1}|s_t,a_t)=p(s_{t+1}|s_t,a_t)\pi_\theta(a_{t+1}|s_{t+1})

Finite horizon case: state-action marginal

θ=argmaxθt=1TE(st,at)pθ(st,at)[r(st,at)]\theta^\star=\arg\max_\theta \sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]

where pθ(st,at)p_\theta(s_t,a_t) is called state-action marginal.

Infinite horizon case: stationary distribution

θ=argmaxθ1Tt=1TE(st,at)pθ(st,at)[r(st,at)]E(s,a)pθ(s,a)[r(s,a)]\theta^\star =\arg\max_{\theta}\frac{1}{T}\sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \to E_{(s,a)\sim p_\theta(s,a)}[r(s,a)]

where μ=pθ(s,a)\mu=p_\theta(s,a) is called stationary distribution.

Thus μ=Tμ\mu =\mathcal{T}\mu , the meaning is that stationary = the same before and after transition

μ\mu is eigenvector of T\mathcal{T} with eigenvalue 1, and it always exists under some regularity conditions.

Last updated