Press enter or click to view image in full size

Photo by Kelsy Gagnebin on Unsplash

How caching Key and Value states makes transformers faster

João Lages

Follow

3 min read

Oct 8, 2023

1.4K

11

Listen

Share

More

Caching the Key (K) and Value (V) states of generative transformers has been around for a while, but maybe you need to understand what it is exactly, and the great inference speedups that it provides.

The Key and Value states are used for calculating the scaled dot-product attention, as seen in the image below.

Scaled dot-product attention and where it is applied in the transformer architecture. (Image source: https://lilianweng.github.io/posts/2018-06-24-attention/#full-architecture)

KV caching occurs during multiple token generation steps and only happens in the decoder (i.e., in decoder-only models like GPT, or in the decoder part of encoder-decoder models like T5). Models like BERT are not generative and therefore do not have KV caching.