Variance is too high!
Very bad sample efficiency.

The randomness of Policy’s trajectory distribution can be quite significant!
$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}{i,t} | \mathbf{s}{i,t}) \underbrace{\hat{Q}{i,t}^{\pi}}{\text{"reward to go"}} $$
*Here $\hat{Q}{i,t}^{\pi}$ means a single sample estimate of expected reward when we take an action $a{i,t}$ in state $s_{i,t}$.
Reward-to-go(cumulative “future” reward) is a value measured just by one trajectory(single sample estimate) of a VERY COMPLEX estimate
$→$ it doesn’t represent the $\mathbb{E}$(average) of the Value and has high variance but rather only ONE possibility because it’s a single sample estimate.
$→$ this may introduce high variance because the reward value will differ between which sample is chosen.
Idea: Actor-Critic Algorithm — Instead of using reward-to-go, one way to solve this is using a Value Function called Q-function.
The total reward expected starting from certain state $s_t$ when following the policy $\pi_\theta$. This works as a baseline. This is more sophisticated than $r(\tau) - b$ because it considers which state it’s in.
measures how advnatageous certain action is compared to the baseline value (calculated from $V(s_t)$).