In Monte Carlo policy gradients, we have 2 options:
option 1:
option 2:
Consider causality:
Because the reason why we use discount factor is to solve infinity problems in continuous cases, but death model(option 2) only cares the early steps of the whole episode. We want to approximate to the average reward without discount. The future rewards is more uncertain, which needs to be removed gradually.
By now, we only discussed episodic tasks, but what about continuous/cyclical tasks? What if T is ∞? In many cases, V^ϕπ can get infinitely large. A simple trick will solve this problem: better to get rewards sooner than later.
option 1 only changes "reward to go" and only consider reward discount from current state. But option 2 consider the whole episode. So option 2 is the true situation when the robot have some chances γ to die every step. But option 1 is what we choose.
====1: sample {si,ai} from πθ(a∣s) (run it on the robot)
====2: fit V^ϕπ(s) to sampled reward sums (use bootstrapped estimate target)