RL
  • Introduction
  • Deep RL Course
    • Introduction
    • Intro to RL
      • MDP Definition
      • RL Objective
      • Structure of RL algorithms
      • Value functions and Q-functions
      • Types of RL algorithms
      • Comparison
    • Policy Gradient
      • Evaluate the PG
      • Intuition of PG
      • Reduce variance
      • Off-policy version PG
      • PG in practice
    • Actor-Critic Algorithms
      • Advantages
      • Policy evaluation
      • Discount factors
      • Actor-Critic in practice
      • Baselines
      • Other advantages
    • Value Function Methods
      • Policy iteration
      • Value iteration
      • Q iteration
      • Learning theory
    • Deep RL with Q-Function
      • Correlated samples and unstable target
      • The accuracy of Q-function
      • Continuous actions
    • Advanced Policy Gradient
    • Optimal Control and Planning
    • Model-Based RL
    • Advanced Model-Based RL
    • Model-Based RL and Policy Learning
    • Variational Inference and Generative Models
    • Reframing Control as an Inference Problem
    • Inverse Reinforcement Learning
    • Exploration
    • Transfer Learning
    • Multi-Task Learning
    • Meta Learning
    • RL Challenges
    • AutoML
  • Related Papers
    • Meta RL
  • Resources
    • Resources
    • Spinning Up by OpenAI
  • RL in practice
    • Policy gradients
Powered by GitBook
On this page
  • Evaluating the objective
  • Direct policy gradient
  • Evaluating the policy gradient
  • Continuous actions: Gaussian policies
  • Partial observability

Was this helpful?

  1. Deep RL Course
  2. Policy Gradient

Evaluate the PG

From last chapter, the goal of RL is

θ⋆=arg⁡max⁡θEτ∼pθ(τ)[∑tr(st,at)]\theta^\star=\arg\max_{\theta} E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]θ⋆=argθmax​Eτ∼pθ​(τ)​[t∑​r(st​,at​)]

and denote J(θ)J(\theta)J(θ) as

J(θ)=Eτ∼pθ(τ)[∑tr(st,at)]J(\theta)=E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]J(θ)=Eτ∼pθ​(τ)​[t∑​r(st​,at​)]

which is called the objective.

Evaluating the objective

Use Monte Carlo method to get samples and estimate J(θ)J(\theta)J(θ)

J(θ)≈1N∑i∑tr(si,t,ai,t)J(\theta)\approx \frac{1}{N}\sum_i\sum_t r(s_{i,t},a_{i,t})J(θ)≈N1​i∑​t∑​r(si,t​,ai,t​)

Every sample(each iii) is a trajectory over time ttt from πθ\pi_\thetaπθ​.

Direct policy gradient

We need to improve the objective J(θ)J(\theta)J(θ), so we can just take the derivative of J(θ)J(\theta)J(θ) and then use gradient ascent. Denote r(τ)=∑t=1Tr(st,at)r(\tau)=\sum_{t=1}^T r(s_t,a_t)r(τ)=∑t=1T​r(st​,at​)

J(θ)=Eτ∼πθ(τ)[r(τ)]=∫πθ(τ)r(τ)dτJ(\theta)=E_{\tau\sim \pi_\theta (\tau)}[r(\tau)]=\int \pi_\theta (\tau)r(\tau)d\tauJ(θ)=Eτ∼πθ​(τ)​[r(τ)]=∫πθ​(τ)r(τ)dτ

So the derivative of J(θ)J(\theta)J(θ) with respect to θ\thetaθ is

∇θJ(θ)=∫∇θπθ(τ)r(τ)dτ=∫πθ(τ)∇θlog⁡πθ(τ)r(τ)dτ=Eτ∼πθ(τ)[∇θlog⁡πθ(τ)r(τ)]\nabla_\theta J(\theta) =\int \nabla _\theta \pi_\theta(\tau)r(\tau)d\tau =\int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau )d\tau =E_{\tau\sim \pi_\theta (\tau)}[\nabla_\theta \log \pi_\theta(\tau) r(\tau )]∇θ​J(θ)=∫∇θ​πθ​(τ)r(τ)dτ=∫πθ​(τ)∇θ​logπθ​(τ)r(τ)dτ=Eτ∼πθ​(τ)​[∇θ​logπθ​(τ)r(τ)]

The second equation uses a convenient identity (log trick)

πθ(τ)∇θlog⁡πθ(τ)=πθ(τ)∇θπθ(τ)πθ(τ)=∇θπθ(τ)\pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) =\pi_\theta(\tau) \frac{\nabla_\theta\pi_\theta(\tau) }{\pi_\theta(\tau) }=\nabla_\theta\pi_\theta(\tau)πθ​(τ)∇θ​logπθ​(τ)=πθ​(τ)πθ​(τ)∇θ​πθ​(τ)​=∇θ​πθ​(τ)

From above, J(θ)J(\theta)J(θ) is the expectation of r(τ)r(\tau)r(τ) and ∇θJ(θ)\nabla_\theta J(\theta)∇θ​J(θ) is the expectation of ∇θlog⁡πθ(τ)r(τ)\nabla_\theta \log \pi_\theta(\tau) r(\tau )∇θ​logπθ​(τ)r(τ), (with weight ∇θlog⁡πθ(τ)\nabla_\theta \log \pi_\theta(\tau)∇θ​logπθ​(τ)), where τ\tauτ follows πθ(τ)\pi_\theta (\tau)πθ​(τ).

We already know r(τ)=∑t=1Tr(st,at)r(\tau)=\sum_{t=1}^T r(s_t,a_t)r(τ)=∑t=1T​r(st​,at​), and what is ∇θlog⁡πθ(τ)\nabla_\theta \log \pi_\theta(\tau)∇θ​logπθ​(τ) ?

πθ(τ)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)log⁡πθ(τ)=log⁡p(s1)+∑t=1T[log⁡πθ(at∣st)+log⁡p(st+1∣st,at)]∇θlog⁡πθ(τ)=∇θ[log⁡p(s1)+∑t=1T[log⁡πθ(at∣st)+log⁡p(st+1∣st,at)]]=∑t=1T∇θlog⁡πθ(at∣st)\begin{aligned} \pi_\theta(\tau)&= p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) \\ \log \pi_\theta(\tau) &= \log p(s_1) +\sum_{t=1}^T [\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)]\\ \nabla_\theta \log \pi_\theta(\tau) &= \nabla_\theta \left[\log p(s_1) +\sum_{t=1}^T \left[\log\pi_\theta (a_t|s_t)+\log p(s_{t+1}|s_t,a_t)\right] \right] \\ &=\sum_{t=1}^T\nabla_\theta \log\pi_\theta (a_t|s_t) \\ \end{aligned}πθ​(τ)logπθ​(τ)∇θ​logπθ​(τ)​=p(s1​)t=1∏T​πθ​(at​∣st​)p(st+1​∣st​,at​)=logp(s1​)+t=1∑T​[logπθ​(at​∣st​)+logp(st+1​∣st​,at​)]=∇θ​[logp(s1​)+t=1∑T​[logπθ​(at​∣st​)+logp(st+1​∣st​,at​)]]=t=1∑T​∇θ​logπθ​(at​∣st​)​

The last equation is because some terms have nothing to do with θ\thetaθ. What is worthwhile to mention is that the log trick transfer ∏\prod∏ to ∑\sum∑, which is friendly to ∇\nabla∇ and easy to estimate.

Finally,

∇θJ(θ)=Eτ∼πθ(τ)[(∑t=1T∇θlog⁡πθ(at∣st))(∑t=1Tr(st,at))]\nabla_\theta J(\theta) =E_{\tau\sim \pi_\theta (\tau)} \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_t|s_t) \right) \left(\sum_{t=1}^T r(s_t,a_t) \right)\right]∇θ​J(θ)=Eτ∼πθ​(τ)​[(t=1∑T​∇θ​logπθ​(at​∣st​))(t=1∑T​r(st​,at​))]

Notice that this estimation equation doesn't need transition probability and initial state. We can just sample from environment without knowing the transition of this dynamic system. Besides, πθ(τ)\pi_\theta(\tau)πθ​(τ) can be customized.

Evaluating the policy gradient

In practice, we can use NNN trajectory samples and take the average.

∇θJ(θ)=1N∑i=1N[(∑t=1T∇θlog⁡πθ(ai,t∣si,t))(∑t=1Tr(si,t,ai,t))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|s_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]∇θ​J(θ)=N1​i=1∑N​[(t=1∑T​∇θ​logπθ​(ai,t​∣si,t​))(t=1∑T​r(si,t​,ai,t​))]

Naturally, we obtain the following algorithms

REINFORCE algorithm:

repeat until converge

====1: sample {τi\tau^iτi} from πθ(at∣st)\pi_\theta (a_t|s_t)πθ​(at​∣st​) (run the current policy)

====2: ∇θJ(θ)=1N∑i=1N[(∑t=1T∇θlog⁡πθ(ati∣sti))(∑t=1Tr(sti,ati))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[\left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{t}^i|s_{t}^i) \right) \left(\sum_{t=1}^T r(s_{t}^i,a_{t}^i) \right) \right]∇θ​J(θ)=N1​∑i=1N​[(∑t=1T​∇θ​logπθ​(ati​∣sti​))(∑t=1T​r(sti​,ati​))]

====3: θ←θ+α∇θJ(θ)\theta\leftarrow \theta+ \alpha \nabla_\theta J(\theta)θ←θ+α∇θ​J(θ)

Continuous actions: Gaussian policies

For continuous actions, we can use Gaussian policy

πθ(at∣st)=N(fneural network(st);Σ)log⁡πθ(at∣st)=−12∣∣f(st)−at∣∣Σ2+const∇θπθ(at∣st)=−12Σ−1(f(st)−at))dfdθ\begin{aligned} &\pi_\theta(a_t|s_t)=\mathcal{N}(f_{\text{neural network}}(s_t);\Sigma) \\ &\log \pi_\theta(a_t|s_t) =-\frac{1}{2}||f(s_t)-a_t||^2_\Sigma +\text{const} \\ & \nabla_{\theta}\pi_\theta(a_t|s_t) =-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t))\frac{df}{d\theta} \end{aligned}​πθ​(at​∣st​)=N(fneural network​(st​);Σ)logπθ​(at​∣st​)=−21​∣∣f(st​)−at​∣∣Σ2​+const∇θ​πθ​(at​∣st​)=−21​Σ−1(f(st​)−at​))dθdf​​

Partial observability

∇θJ(θ)=1N∑i=1N[(∑t=1T∇θlog⁡πθ(ai,t∣oi,t))(∑t=1Tr(si,t,ai,t))]\nabla_\theta J(\theta) =\frac{1}{N}\sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log\pi_\theta (a_{i,t}|o_{i,t}) \right) \left(\sum_{t=1}^T r(s_{i,t},a_{i,t}) \right) \right]∇θ​J(θ)=N1​i=1∑N​[(t=1∑T​∇θ​logπθ​(ai,t​∣oi,t​))(t=1∑T​r(si,t​,ai,t​))]

Notice that Markov property is not actually used, and we can use policy gradient in partially observed MDPs without modification.

PreviousPolicy GradientNextIntuition of PG

Last updated 5 years ago

Was this helpful?

Evaluate objective
REINFORCE algorithm
partial observability