Policy Gradient Theorem

Objective Function $J(\theta)$

As previously explain, object function means the expected cumulative reward of $t=t$ to $t=T$ and the goal of RL is to maximize the Objective function’s output by tuning the parameters in Policy.

$$ \theta^* = \underset{\theta}{\arg\max} \underbrace{\mathbb{E}{\tau \sim p{\theta}(\tau)} \left[ \sum_{t} r(\mathbf{s}_t, \mathbf{a}t) \right]}{J(\theta)} $$

Evaluation the Objective

In model free algorithms, where we don’t know the transition function of each state, we need to run episodes to estimate the values.

The most inutitive way to express this is to run the policy(or sample) on the environment couple of times and get the average of the trials. This method is called Monte Carlo Estimation and is mathematically describes as

$$ J(\theta) = E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r(s_t, a_t)\right] \approx \frac{1}{N} \sum_{i} \sum_{t} r(s_{i,t}, a_{i,t}) $$

where $i$ is the index of sample and $t$ of course is the time step of each state/action.

Direct Policy Differentiation

we can use the same method (Monte Carlo Estimation) on gradient as well.

$$ J(\theta) = \mathbb{E}{\tau \sim p{\theta}(\tau)} [r(\tau)] = \int p_{\theta}(\tau)r(\tau)d\tau \quad \text{where} \quad r(\tau) = \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) $$

If we apply gradient to each sides, we get

$$ \nabla_{\theta}J(\theta) = \nabla_{\theta} \int p_{\theta}(\tau)r(\tau)d\tau = \int \nabla_{\theta} p_{\theta}(\tau)r(\tau)d\tau $$

By using a mathematical trick called log deriviation identity, we can express $\nabla_{\theta} p_{\theta}(\tau)$ as

$$ p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau) = p_{\theta}(\tau) \frac{\nabla_{\theta} p_{\theta}(\tau)}{p_{\theta}(\tau)} = \nabla_{\theta} p_{\theta}(\tau) $$