We're starting from actor-critic, which has three steps:
The key insight from actor-critic: we fit $V^\pi_\phi(s)$ to estimate future rewards, then compute advantages $A^\pi(s,a) = r(s,a) + \gamma V^\pi(s') - V^\pi(s)$. This tells us how much better action $a$ is than average.
$s$: current state
$s'$: next state
Actor Critic, Q-Learning 정리
recap on $Q$ and $V$
Here's where things get interesting. Look at what $\arg\max_a A^\pi(s,a)$ tells us: it's the best action from state $s$ if we follow policy $\pi$ afterward. This is at least as good as any action sampled from $\pi$, regardless of what $\pi$ actually is.
The insight: Why bother with an explicit policy at all?
If we have $A^\pi$ or $Q^\pi$, we can just define our policy as:
$$ \pi'(a|s) = \begin{cases} 1 & \text{if } a = \arg\max_a A^\pi(s,a) \ 0 & \text{otherwise} \end{cases} $$
This is a deterministic policy that just picks the best action according to our value function. This policy is as good as or better than $\pi$, regardless of what $\pi$ is!