RL Objective
Last updated
Last updated
We usually denote as . The left part is trajectory probability and the right part is Markov Chain.
What we really need is to find an optimal (denoted as )
where is called state-action marginal.
where is called stationary distribution.
Thus , the meaning is that stationary = the same before and after transition
is eigenvector of with eigenvalue 1, and it always exists under some regularity conditions.