Variance is too high!
Very bad sample efficiency.

The randomness of Policy’s trajectory distribution can be quite significant!
$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}{i,t} | \mathbf{s}{i,t}) \underbrace{\hat{Q}{i,t}^{\pi}}{\text{"reward to go"}} $$
*Here $\hat{Q}{i,t}^{\pi}$ means a single sample estimate of expected reward when we take an action $a{i,t}$ in state $s_{i,t}$.
Reward-to-go(cumulative “future” reward) is a value measured just by one trajectory(single sample estimate) of a VERY COMPLEX estimate
$→$ it doesn’t represent the $\mathbb{E}$(average) of the Value and has high variance but rather only ONE possibility because it’s a single sample estimate.
$→$ this may introduce high variance because the reward value will differ between which sample is chosen.
다시 설명하자면…
Policy Gradient는 Objective function을 gradient ascent하여 더 좋은 Policy로 update하는 것이었다.
이를 식으로 나타내면
$$ \nabla_\theta J(\theta) \approx \\ \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{t=1}^{T}{\nabla_\theta \log{\pi_\theta(a_{i,t}|s_{i,t})}} \left(\displaystyle\sum_{t'=t}^{T}{r(s_{i,t'},a_{i,t'})} \right) \\ = \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{t=1}^{T}{\nabla_\theta \log{\pi_\theta(a_{i,t}|s_{i,t})}} \hat{Q}_{i,t} $$
즉 Q function의 single sample estimate에 의존하고 있음을 알 수 있다.
Q function의 single sample estimation이라는 것은 $Q(s_t, a_t)$에 대한 추정 중 하나의 sample이라는 것이고 sample 에 따라 값은 달라질 수 있음 = high variance problem