Policy evaluation
Last updated
Last updated
So now we need to fit one of . The question is fit what to what? We knew:
if we are able to reset the simulator, we can do more than one samples:
But the former is still pretty good, so the supervised learning condition is:
Sample one step and the directly use previous fitted value function.
This estimate will cause more bias but lower variance.
Batch actor-critic algorithm:
repeat until converge:
Firstly, both and can be calculated or approximated by :
Apart from this, both and need 2 inputs , but only need , which may be easier.
So let's just fit .
Fit , evaluate how good is the policy, which is called policy evaluation. As what policy gradient does, we can use Monte Carlo policy evaluation.
training data: . Define target .
supervised regression:
The Monte Carlo target is not perfect. Can we do better? The ideal target is:
====1: sample from (run it on the robot)
====2: fit to sampled reward sums (use bootstrapped estimate target)
====3: evaluate
====4:
====5: