Policy evaluation

Fit what ?

So now we need to fit one of Qπ,Vπ,AπQ^\pi,V^\pi,A^\pi. The question is fit what to what? We knew:

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Vπ(st)=Eatπθ(atst)[Qπ(st,at)]Aπ(st,at)=Qπ(st,at)Vπ(st)\begin{aligned} Q^\pi(s_t,a_t)&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ V^\pi(s_t)&=\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}[Q^\pi(s_t,a_t)] \\ A^\pi(s_t,a_t)&=Q^\pi(s_t,a_t)-V^\pi(s_t) \end{aligned}

Firstly, both QπQ^\pi and AπA^\pi can be calculated or approximated by VπV^\pi:

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]r(st,at)+t=t+1TEπθ[r(st,at)st,at]r(st,at)+Vπ(st+1)Aπ(st,at)r(st,at)+Vπ(st+1)Vπ(st)\begin{aligned} Q^\pi(s_t,a_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \\ &\approx r(s_t,a_t)+V^\pi(s_{t+1})\\ A^\pi(s_t,a_t)&\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t) \end{aligned}

Apart from this, both QπQ^\pi and AπA^\pi need 2 inputs st,ats_t, a_t, but VπV^\pi only need sts_t, which may be easier.

So let's just fit VπV^\pi.

Fitted to what?

Vπ(st)=t=tTEπθ[r(st,at)st]V^\pi(s_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]

Fit Vπ(st)V^\pi(s_t), evaluate how good is the policy, which is called policy evaluation. As what policy gradient does, we can use Monte Carlo policy evaluation.

Vπ(st)t=tTr(st,at)V^\pi(s_t)\approx\sum_{t'=t}^T r(s_{t'},a_{t'})

if we are able to reset the simulator, we can do more than one samples:

Vπ(st)1Ni=1Nt=tTr(st,at)V^\pi(s_t)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})

But the former is still pretty good, so the supervised learning condition is:

training data: {(si,t,t=tTr(si,t,ai,t))}\left\{\left (s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right) \right\}. Define target yi,t=t=tTr(si,t,ai,t)y_{i,t}=\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}).

supervised regression: L(ϕ)=12iV^ϕπ(si)yi2\mathcal{L}(\phi)=\frac{1}{2}\sum_i ||\hat{V}_{\phi}^\pi(s_i)-y_i||^2

Bootstrapped estimate

The Monte Carlo target yiy_i is not perfect. Can we do better? The ideal target is:

yi,t=t=tTEπθ[r(si,t,ai,t)si,t]r(si,t,ai,t)+t=t+1TEπθ[r(si,t,ai,t)si,t]r(si,t,ai,t)+Vπ(st+1)r(si,t,ai,t)+V^ϕπ(st+1)\begin{aligned} y_{i,t}&=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{i,t'},a_{i,t'})|s_{i,t}] \\ &\approx r(s_{i,t},a_{i,t})+V^\pi(s_{t+1})\\ &\approx r(s_{i,t},a_{i,t})+\hat{V}^\pi_{\phi}(s_{t+1})\\ \end{aligned}

Sample one step and the directly use previous fitted value function.

This estimate will cause more bias but lower variance.

Algorithm

Batch actor-critic algorithm:

repeat until converge:

====1: sample {si,ai}\{s_i,a_i \} from πθ(as)\pi_\theta(a|s) (run it on the robot)

====2: fit V^ϕπ(s)\hat{V}^\pi_\phi(s) to sampled reward sums (use bootstrapped estimate target)

====3: evaluate A^π(si,ai)=r(si,ai)+V^ϕπ(si)V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)

====4: θJ(θ)iθlogπθ(aisi)A^π(si,ai)\nabla_\theta J(\theta)\approx \sum_i \nabla_\theta \log \pi_\theta (a_i|s_i)\hat{A}^\pi(s_i,a_i)

====5: θθ+αθJ(θ)\theta \leftarrow \theta+ \alpha\nabla_\theta J(\theta)

Last updated