The Problem of Policy Gradient

Variance is too high!

Very bad sample efficiency.

The randomness of Policy’s trajectory distribution can be quite significant!

The randomness of Policy’s trajectory distribution can be quite significant!

$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}{i,t} | \mathbf{s}{i,t}) \underbrace{\hat{Q}{i,t}^{\pi}}{\text{"reward to go"}} $$

*Here $\hat{Q}{i,t}^{\pi}$ means a single sample estimate of expected reward when we take an action $a{i,t}$ in state $s_{i,t}$.

Reward-to-go(cumulative “future” reward) is a value measured just by one trajectory(single sample estimate) of a VERY COMPLEX estimate

$→$ it doesn’t represent the $\mathbb{E}$(average) of the Value and has high variance but rather only ONE possibility because it’s a single sample estimate.

$→$ this may introduce high variance because the reward value will differ between which sample is chosen.

Idea: Actor-Critic Algorithm — Instead of using reward-to-go, one way to solve this is using a Value Function called Q-function.

Terminologies

State-Value Function $V^\pi(s_t) = E_{\mathbf{a}t \sim \pi\theta(\mathbf{a}_t | \mathbf{s}_t)}[Q(\mathbf{s}_t, \mathbf{a}_t)]$

The total reward expected starting from certain state $s_t$ when following the policy $\pi_\theta$. This works as a baseline. This is more sophisticated than $r(\tau) - b$ because it considers which state it’s in.

Advantage Function $A^\pi(s_t, a_t)$

measures how advnatageous certain action is compared to the baseline value (calculated from $V(s_t)$).