Deriving the Simplest Policy Gradient

<aside> 💡

The main goal is to maximize $J(\pi_\theta)= E_{\tau \sim \pi_\theta}[R(\tau)]$, which means to get the highest reward.

To say it more simple, the main goal of all the math below is to find a way to answer the question ”How should I change my policy’s parameter $\theta$ to get more rewards?”

</aside>

$$ \theta_{k+1} = \theta_k + \alpha\nabla_\theta J(\pi_\theta)|_{\theta_k}. $$

$\nabla_\theta J(\pi_\theta)$ is called the policy gradient. This means it tries to update $\theta$ to increase the score in the fastest direction.

Examples of Policy gradient Algorithms are TRPO and Vanilla Policy Gradient.

To actually use this algorithm, we need an expression for the policy gradient which we can numerically compute.

This involves two steps:

deriving the analytical gradient of policy performance
forming a sample estimate of that expected value

Deriving the Simplest Policy Gradient

Mathematical Proof of Policy Gradient Algorithm