> For the complete documentation index, see [llms.txt](https://drdh.gitbook.io/rl/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://drdh.gitbook.io/rl/deep-rl-course/intro-to-rl/rl-objective.md).

# RL Objective

## The goal of reinforcement learning

![what is RL](/files/-Liqxqp0pYDIodw28NTB)

Trajectory Probability

$$
p\_\theta (s\_1,a\_1,\cdots,s\_T,a\_T)=p(s\_1)\prod\_{t=1}^T \pi\_\theta(a\_t|s\_t)p(s\_{t+1}|s\_t,a\_t)
$$

We usually denote $$s\_1,a\_1,\cdots,s\_T,a\_T$$ as $$\tau$$. The left part is trajectory probability and the right part is Markov Chain.

What we really need is to find an optimal $$\theta$$ (denoted as $$\theta^\star$$)

$$
\theta^\star=\arg\max\_{\theta} E\_{\tau\sim p\_\theta(\tau)}\left\[\sum\_t r(s\_t,a\_t)\right]
$$

## Two cases

![p(s,a)](/files/-Liqxqp2SzOAy350GFVl)

$$
p(s\_{t+1},a\_{t+1}|s\_t,a\_t)=p(s\_{t+1}|s\_t,a\_t)\pi\_\theta(a\_{t+1}|s\_{t+1})
$$

### Finite horizon case: state-action marginal

$$
\theta^\star=\arg\max\_\theta \sum\_{t=1}^T E\_{(s\_t,a\_t)\sim p\_\theta(s\_t,a\_t)}\[r(s\_t,a\_t)]
$$

where $$p\_\theta(s\_t,a\_t)$$ is called state-action marginal.

### Infinite horizon case: stationary distribution

$$
\theta^\star =\arg\max\_{\theta}\frac{1}{T}\sum\_{t=1}^T E\_{(s\_t,a\_t)\sim p\_\theta(s\_t,a\_t)}\[r(s\_t,a\_t)] \to E\_{(s,a)\sim p\_\theta(s,a)}\[r(s,a)]
$$

where $$\mu=p\_\theta(s,a)$$ is called stationary distribution.

Thus $$\mu =\mathcal{T}\mu$$ , the meaning is that **stationary = the same before and after transition**

$$\mu$$ is eigenvector of $$\mathcal{T}$$ with eigenvalue 1, and it always exists under some regularity conditions.
