RL in a Nutshell

In one sentence, Reinforcement Learning (or RL) is finding the optimal choice for delayed reward in uncertainty.

Unlike Supervised Learning, where label exists, RL is about learning through interaction/experience:

$$ \text{State } s_0 \to \text{Action } a_0 \to \text{Reward } r_0 \to \text{State } s_1 \to \dots $$

The Problem Setup

The definition above is very abstract. When actually solving a RL problem mathematically, we model the world as a Marcov Decision Process (in short, MDP). An MDP assumes the Marcov Property, which means the future depends only on the current state, not the history of how we got there.

While we assume we know the immediate reward $r$, the core difficulty of RL is estimating the return$G_t$.

<aside> 💡

How does the word “reward” differ from “return” in RL?

Reward $r$ is a local, immediate value you get from a certain state and action. It’s often expressed as $r(s,a)$. This is mainly used for representing Objective function in Bellman Equation format.

On the other hand return $G_t$ is expectation of accumulated future reward (global) you also get from a certain state and action.

$$ G_t = \sum_{t=0}^{\infty} \gamma^t r_t $$

If we draw the relationship between $r$ and $G_t$ using Value function in Bellman Equation format,

$$ \begin{align*} V(s) &= \mathbb{E}[G_t] && (\text{Definition of Value Function}) \\ &\downarrow \\ V(s) &= \mathbb{E}[R_{t+1} + \gamma V(S_{t+1})] && (\text{Bellman Equation}) \end{align*} $$

</aside>

Objective Function

Fundamentally every RL algorithm tries to maximize this single equation:

$$ \max_\pi J(\pi) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] $$

This is a mathematical equation which means to find the best parameter($\theta^*$) which finds the path that maximizes($\max$) the average($\mathbb{E}$) of accumulated($\Sigma_t$) future (discounted ($\gamma^t$) reward($r_t$).

While the exact expression means average of accumulated future reward, I will explain it as “future reward” for convenience.

What do we need to solve RL Problem?

As stated above, the core problem is to figure out how to estimate the future reward $G_t$. How do we actually estimate that infinite sum, then? We use Value functions and the Bellman Equation.