vLLM이 도대체 뭘까? (via. PagedAttention)

The Limitation of Transformers

Unlike static input/output, for LLM token generation, Transformers is highly inefficient especially when the length of sentence gets longer.

Each Key and Value Vector is added to the cache per new token.

Transformers use one giant contigious memory block to save KV cache

This introduces two major problems:

  1. Over-reservation

    when inference starts, the model doesn’t know how long the final sentence will be (it is dynamic) When all the memory is used, the model doesn’t work (Out of Memory issue)

    In order to avoid out of memory issue, the model reservates the whole memory based on the maximum-length sequence. This brings huge inefficiency because most sentences are much shorter than the maximum length sequence and they are idle (not used but take space)

  2. Fragmentation

    When we allocate and free memory, the space available gets fragmented. Especially when multiple user’s request is simultaneously given and the inference finishes, the memory is free but because the size of the memory are all different, empty memory are not used taking space..?

The solution they presented is PagedAttention

and vLLM is an easy way to use this variant of Attention

The final goal of PagedAttention is to save key-value tensors in uncontigous GPU VRAM more efficiently.

The authors of the paper was inspired by the Virtual Memory Paging technique from Operating Systems.