Recap of Actor-Critic

We're starting from actor-critic, which has three steps:

  1. Generate samples - run your policy in the environment
  2. Fit a model - learn a value function $V^\pi$ to estimate returns
  3. Improve the policy - use policy gradients with the advantage function $A^\pi$

The key insight from actor-critic: we fit $V^\pi_\phi(s)$ to estimate future rewards, then compute advantages $A^\pi(s,a) = r(s,a) + \gamma V^\pi(s') - V^\pi(s)$. This tells us how much better action $a$ is than average.

Slides 3-4: The Big Idea - Can We Skip Policy Gradients?

Here's where things get interesting. Look at what $\arg\max_a A^\pi(s,a)$ tells us: it's the best action from state $s$ if we follow policy $\pi$ afterward. This is at least as good as any action sampled from $\pi$, regardless of what $\pi$ actually is.

The insight: Why bother with an explicit policy at all?

If we have $A^\pi$ or $Q^\pi$, we can just define our policy as:

$$ \pi'(a|s) = \begin{cases} 1 & \text{if } a = \arg\max_a A^\pi(s,a) \ 0 & \text{otherwise} \end{cases} $$

This is a deterministic policy that just picks the best action according to our value function. This policy is as good as or better than $\pi$, regardless of what $\pi$ is!