Definitions

$$ \begin{align*}\mathbf{s}t & \quad \text{-- state} & \pi\theta(\mathbf{a}_t|\mathbf{o}_t) & \quad \text{-- policy} \\\mathbf{o}t & \quad \text{-- observation} & \pi\theta(\mathbf{a}_t|\mathbf{s}_t) & \quad \text{-- policy (fully observed)} \\\mathbf{a}_t & \quad \text{-- action}\end{align*} $$

$t$ means trajetory, $\theta$ means learnable parameters of the agent’s policy.

Reward Functions

Reward function $r(s,a)$ is the criterion of how we update the policy($\pi_\theta$).

$s, a, r(s, a)$ and $p(s'|s,a)$ define Markov decision process.

where $p(s'|s,a)$ means the probability that we will get to state $s'$ on the given $s$ and $a$.

Marcov Property

Marcov Property only sees the current state $s_t$ and current action space $a_t$ in order to predict the next state($s_{t+1}$). It is independent of $s_1 \sim s_{t-1}$.

Marcov Chain

Marcov Chain defines the probability from moving one state to another.

$$ \mathcal{M} = \{\mathcal{S}, \mathcal{T}\} $$