The goal of reinforcement learning
Trajectory Probability
pθ(s1,a1,⋯,sT,aT)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at) We usually denote s1,a1,⋯,sT,aT as τ. The left part is trajectory probability and the right part is Markov Chain.
What we really need is to find an optimal θ (denoted as θ⋆)
θ⋆=argθmaxEτ∼pθ(τ)[t∑r(st,at)] Two cases
p(st+1,at+1∣st,at)=p(st+1∣st,at)πθ(at+1∣st+1) Finite horizon case: state-action marginal
θ⋆=argθmaxt=1∑TE(st,at)∼pθ(st,at)[r(st,at)] where pθ(st,at) is called state-action marginal.
Infinite horizon case: stationary distribution
θ⋆=argθmaxT1t=1∑TE(st,at)∼pθ(st,at)[r(st,at)]→E(s,a)∼pθ(s,a)[r(s,a)] where μ=pθ(s,a) is called stationary distribution.
Thus μ=Tμ , the meaning is that stationary = the same before and after transition
μ is eigenvector of T with eigenvalue 1, and it always exists under some regularity conditions.