Discount factors

Infinite cases

By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if $T$ is $\infty$ ? In many cases, $\hat{V}^\pi_\phi$ can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.

We have $\gamma$ chances to die every step:

So the new target is:

y_{i,t}\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})

And discount factor $\gamma \in [0,1]$ . (0.99 works well)

\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})-\hat{V}^\pi_\phi(s_{i,t}) \right) \\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{A}^\pi_\phi(s_{i,t},a_{i,t}) \end{aligned}

Discount factors for policy gradient

In Monte Carlo policy gradients, we have 2 options:

option 1:

\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)

option 2:

\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T \gamma^{t-1} r(s_{i,t},a_{i,t})\right)

Consider causality:

\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-1} r(s_{i,t'},a_{i,t'})\right)\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\gamma^{t-1}\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)\\ \end{aligned}

option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances $\gamma$ to die every step. But option 1 is what we choose.

Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.

Actor-critic algorithms (with discount)

batch version

batch actor-critic algorithm:
repeat until converge:
====1: sample $\{s_i,a_i \}$ from $\pi_\theta(a|s)$ (run it on the robot)
====2: fit $\hat{V}^\pi_\phi(s)$ to sampled reward sums (use bootstrapped estimate target)
====3: evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)$
====4: $\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)$
====5: $\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)$

online version

online actor-critic algorithm:
repeat until converge:
====1: take action $a\sim \pi_\theta(a|s)$ , get $(s,a,s',r)$
====2: update $\hat{V}^\pi_\phi$ using target $r+\gamma\hat{V}^\pi_\phi(s')$
====3: evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)$
====4: $\nabla_\theta J(\theta)\approx \nabla_\theta \log \pi_\theta (a|s)\hat{A}^\pi(s,a)$
====5: $\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)$

PreviousPolicy evaluation NextActor-Critic in practice

Last updated 6 years ago

Was this helpful?