V1 | Notion

Differences from V0¶

This section lists some differences in behavior between V0 and V1.

Chunked Prefill¶

Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.

CUDA Graphs¶

CUDA graph capture takes up more memory in V1 than in V0.

Semantic Changes to Logprobs¶

Logprobs Calculation¶

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.

You can adjust this behavior by setting the --logprobs-mode flag. Four modes are supported: raw_logprobs (default), processed_logprobs, raw_logits, processed_logits. Raw means the values before applying any logit processors, like bad words. Processed means the values after applying all processors, including temperature and top_k/top_p.

Prompt Logprobs with Prefix Caching¶

While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

Feature Support¶

For each item, its support in vLLM V1 falls into one of the following states:

🟢 Functional: Fully operational with optimizations comparable to or better than V0.
🟡 In Progress: Planned to be in vLLM V1, with open PRs/RFCs.
🔴 Removed: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.

Note

vLLM V1’s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come, First-Served (FCFS) and priority-based scheduling (where requests are processed based on assigned priority, with FCFS as a tie-breaker), configurable via the --scheduling-policy argument.