RL in a Nutshell

In one sentence, Reinforcement Learning (or RL) is finding the optimal choice for delayed reward in uncertainty.

Unlike Supervised Learning, where label exists, RL is about learning through interaction/experience:

$$ \text{State } s_0 \to \text{Action } a_0 \to \text{Reward } r_0 \to \text{State } s_1 \to \dots $$

The Problem Setup

The definition above is very abstract. When actually solving a RL problem mathematically, we model the world as a Marcov Decision Process (in short, MDP). An MDP assumes the Marcov Property, which means the future depends only on the current state, not the history of how we got there.

While we assume we know the immediate reward $r$, the core difficulty of RL is estimating the future reward $G_t$.

Objective Function

Fundamentally every RL algorithm tries to maximize this single equation:

$$ \max_\pi J(\pi) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] $$

This is a mathematical equation which means to find the best parameter($\theta^*$) which finds the path that maximizes($\max$) the average($\mathbb{E}$) of accumulated($\Sigma_t$) future (discounted ($\gamma^t$) reward($r_t$).

While the exact expression means average of accumulated future reward, I will explain it as “future reward” for convenience.

What do we need to solve RL Problem?

As stated above, the core problem is to figure out how to estimate the future reward $G_t$. How do we actually estimate that infinite sum, then? We use Value functions and the Bellman Equation.

RL in a Nutshell

The Problem Setup

Objective Function

What do we need to solve RL Problem?

Value Function