Discount factors

Infinite cases

By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if TT is \infty? In many cases, V^ϕπ\hat{V}^\pi_\phi can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.

We have γ\gamma chances to die every step:

new MDP

So the new target is:

yi,tr(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t}\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})

And discount factor γ[0,1]\gamma \in [0,1]. (0.99 works well)

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)V^ϕπ(si,t))1Ni=1Nt=1Tθlogπθ(ai,tsi,t)A^ϕπ(si,t,ai,t)\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(r(s_{i,t},a_{i,t})+\gamma \hat{V}^\pi_\phi(s_{i,t+1})-\hat{V}^\pi_\phi(s_{i,t}) \right) \\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{A}^\pi_\phi(s_{i,t},a_{i,t}) \end{aligned}

Discount factors for policy gradient

In Monte Carlo policy gradients, we have 2 options:

option 1:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)

option 2:

θJ(θ)1Ni=1N(t=1Tθlogπθ(ai,tsi,t))(t=1Tγt1r(si,t,ai,t))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T \gamma^{t-1} r(s_{i,t},a_{i,t})\right)

Consider causality:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγt1r(si,t,ai,t))1Ni=1Nt=1Tγt1θlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-1} r(s_{i,t'},a_{i,t'})\right)\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\gamma^{t-1}\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)\\ \end{aligned}

option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances γ\gamma to die every step. But option 1 is what we choose.

Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.

Actor-critic algorithms (with discount)

batch version

batch actor-critic algorithm:

repeat until converge:

====1: sample {si,ai}\{s_i,a_i \} from πθ(as)\pi_\theta(a|s) (run it on the robot)

====2: fit V^ϕπ(s)\hat{V}^\pi_\phi(s) to sampled reward sums (use bootstrapped estimate target)

====3: evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)

====4: θJ(θ)iθlogπθ(aisi)A^π(si,ai)\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)

====5: θθ+αθJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)

online version

online actor-critic algorithm:

repeat until converge:

====1: take action aπθ(as)a\sim \pi_\theta(a|s), get (s,a,s,r)(s,a,s',r)

====2: update V^ϕπ\hat{V}^\pi_\phi using target r+γV^ϕπ(s)r+\gamma\hat{V}^\pi_\phi(s')

====3: evaluate A^π(s,a)=r(s,a)+γV^ϕπ(s)V^ϕπ(s)\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)

====4: θJ(θ)θlogπθ(as)A^π(s,a)\nabla_\theta J(\theta)\approx \nabla_\theta \log \pi_\theta (a|s)\hat{A}^\pi(s,a)

====5: θθ+αθJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)

Last updated

Was this helpful?