<aside> đź’ˇ
The main goal is to maximize $J(\pi_\theta)= E_{\tau \sim \pi_\theta}[R(\tau)]$, which means to get the highest reward.
To say it more simple, the main goal of all the math below is to find a way to answer the question ”How should I change my policy’s parameter $\theta$ to get more rewards?”
</aside>
$$ \theta_{k+1} = \theta_k + \alpha\nabla_\theta J(\pi_\theta)|_{\theta_k}. $$
Examples of Policy gradient Algorithms are TRPO and Vanilla Policy Gradient.
To actually use this algorithm, we need an expression for the policy gradient which we can numerically compute.
This involves two steps: