The distributional shift problem

Terminologies
- Training Trajectory: The path demonstrated by the expert (e.g., a human driver), which is used as the training data.
- $π_θ$ Expected Trajectory: The path the learned AI policy actually takes.
- $\theta$ represents (learned) parameters
- $p_{\text{data}}(o_t)$: This is the distribution of observations seen by the expert during training. The agent is trained to perform well on this data.
- $p_{π_θ}(o_t)$: This is the distribution of observations the agent sees when it's actually performing the task.
- $\epsilon$(read as “epsilon”): This is the error rate under the training distribution ($p_{\text{data}}$) (not under the learner’s actual states!!). Intuition: think of epsilon as a small positive number(for example, $0.01$).
so what is the distribution shift problem?
The problem is that these two distributions (or data) are not the same ($p_{\text{data}}(o_t) ≠ p_{π_θ}(o_t)$). The agent is tested in situations it was never trained for.
For example, image above shows even a tiny error by the agent can send it into a situation the expert never encountered (= not observed). In this new, unfamiliar state, the agent doesn't know what to do, likely making even bigger mistakes and unreliable behavior and so on... This compounding of errors is called distributional shift.
Ok.. Let's define more precisely what we want then
While the training process focuses on mimicking the expert's actions on the training data (supervised learning), what we really want is for the agent to perform the task well in the real world.
To measure this, a cost function $c(s, a)$ is introduced:
- Cost = 0: If the agent's action matches the expert's action in a given state.
- Cost = 1: If the agent makes a mistake (i.e., its action differs from the expert's).
The true goal is to minimize the total cost (number of mistakes) under the agent’s own distribution of states, $p_{π_θ}$, not the expert's training data distribution.