Differences from V0

This section lists some differences in behavior between V0 and V1.

Chunked Prefill

Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.

CUDA Graphs

CUDA graph capture takes up more memory in V1 than in V0.

Semantic Changes to Logprobs

Logprobs Calculation

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.

You can adjust this behavior by setting the --logprobs-mode flag. Four modes are supported: raw_logprobs (default), processed_logprobsraw_logitsprocessed_logits. Raw means the values before applying any logit processors, like bad words. Processed means the values after applying all processors, including temperature and top_k/top_p.

Prompt Logprobs with Prefix Caching

While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

Feature Support

For each item, its support in vLLM V1 falls into one of the following states:

Note

vLLM V1’s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come, First-Served (FCFS) and priority-based scheduling (where requests are processed based on assigned priority, with FCFS as a tie-breaker), configurable via the --scheduling-policy argument.