RL Objective

The goal of reinforcement learning

Trajectory Probability

p_\theta (s_1,a_1,\cdots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

We usually denote $s_1,a_1,\cdots,s_T,a_T$ as $\tau$ . The left part is trajectory probability and the right part is Markov Chain.

What we really need is to find an optimal $\theta$ (denoted as $\theta^\star$ )

\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]

p(s_{t+1},a_{t+1}|s_t,a_t)=p(s_{t+1}|s_t,a_t)\pi_\theta(a_{t+1}|s_{t+1})

\theta^\star=\arg\max_\theta \sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]

where $p_\theta(s_t,a_t)$ is called state-action marginal.

\theta^\star =\arg\max_{\theta}\frac{1}{T}\sum_{t=1}^T E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \to E_{(s,a)\sim p_\theta(s,a)}[r(s,a)]

where $\mu=p_\theta(s,a)$ is called stationary distribution.

Thus $\mu =\mathcal{T}\mu$ , the meaning is that stationary = the same before and after transition

$\mu$ is eigenvector of $\mathcal{T}$ with eigenvalue 1, and it always exists under some regularity conditions.

Last updated 6 years ago

Was this helpful?