The Problem of Policy Gradient

Variance is too high!

Very bad sample efficiency.

The randomness of Policy’s trajectory distribution can be quite significant!

The randomness of Policy’s trajectory distribution can be quite significant!

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}{i,t} | \mathbf{s}{i,t}) \underbrace{\hat{Q}{i,t}^{\pi}}{\text{"reward to go"}} $$

*Here $\hat{Q}{i,t}^{\pi}$ means a single sample estimate of expected reward when we take an action $a{i,t}$ in state $s_{i,t}$.

Reward-to-go(cumulative “future” reward) is a value measured just by one trajectory(single sample estimate) of a VERY COMPLEX estimate

$→$ it doesn’t represent the $\mathbb{E}$(average) of the Value and has high variance but rather only ONE possibility because it’s a single sample estimate.

$→$ this may introduce high variance because the reward value will differ between which sample is chosen.

다시 설명하자면…

Policy Gradient는 Objective function을 gradient ascent하여 더 좋은 Policy로 update하는 것이었다.

이를 식으로 나타내면

$$ \nabla_\theta J(\theta) \approx \\ \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{t=1}^{T}{\nabla_\theta \log{\pi_\theta(a_{i,t}|s_{i,t})}} \left(\displaystyle\sum_{t'=t}^{T}{r(s_{i,t'},a_{i,t'})} \right) \\ = \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{t=1}^{T}{\nabla_\theta \log{\pi_\theta(a_{i,t}|s_{i,t})}} \hat{Q}_{i,t} $$

즉 Q function의 single sample estimate에 의존하고 있음을 알 수 있다.

Q function의 single sample estimation이라는 것은 $Q(s_t, a_t)$에 대한 추정 중 하나의 sample이라는 것이고 sample 에 따라 값은 달라질 수 있음 = high variance problem