vllm is essentially trying to utilize memory better than existing systems for llm serving models.

The paper explains many techniques for different workloads, which are all important in modern LLM tasks(e.g. beam search is important in AI code assistant).

If you remember the last post I’ve written

While there are several techniques such as shared prefix that enhances Prefill’s throughput, the very core solution of vLLM and PagedAttention is how they tried to solve memory-bound decoding problems.

it’s important to differentiate logical kv blocks and physical kv blocks.

$\text{Total VRAM} = \text{Model Weights} + \text{Activation Overhead} + \text{KV Cache}$