Discount factors
Last updated
Last updated
By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if is ? In many cases, can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.
We have chances to die every step:
So the new target is:
In Monte Carlo policy gradients, we have 2 options:
option 1:
option 2:
Consider causality:
Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.
batch version
batch actor-critic algorithm:
repeat until converge:
online version
online actor-critic algorithm:
repeat until converge:
And discount factor . (0.99 works well)
option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances to die every step. But option 1 is what we choose.
====1: sample from (run it on the robot)
====2: fit to sampled reward sums (use bootstrapped estimate target)
====3: evaluate
====4:
====5:
====1: take action , get
====2: update using target
====3: evaluate
====4:
====5: