Model Based Reinforcement Learning

image.png

기존 강의들 chapter 4 ~ 9에서 살펴보았던 방법들은 모두 $p(s_{t+1}|s_t, a_t)$를 모른다고 가정하고 sampling (여러번 rollout을 진행해서 분포를 추정)하는 방식으로 진행했었음.

What if we knew transition dynamics (transition function) ?

e.g. Game, easily modeled systems, simulated environments, system identification, learning etc.

Knowing dynamics (often) makes things easier (no need to sample and more stable etc.)

The Objective

in terms of min, we are minimizing the cost:

$$ \min_{\mathbf{a}_1, \dots, \mathbf{a}T} \sum{t=1}^{T} c(\mathbf{s}_t, \mathbf{a}t) \quad \text{s.t.} \ \mathbf{s}t = f(\mathbf{s}{t-1}, \mathbf{a}{t-1})  $$

Here function $f$ is the transition function.

Deterministic Case

Deterministic case means all the action in a given state has 100% probability of the next state based on how the transition function is defined.

$$ \mathbf{a}_1, \dots, \mathbf{a}T = \arg \max{\mathbf{a}_1, \dots, \mathbf{a}T} \sum{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}t) \quad \text{s.t.} \ \mathbf{s}{t+1} = f(\mathbf{s}_t, \mathbf{a}_t) $$

It’s actually exactly the same meaning of the above equation The Objective just differently expressed (minimizing the cost == maximizing the reward).

Stochastic Open-loop Case