Press enter or click to view image in full size
Photo by Kelsy Gagnebin on Unsplash

Follow
3 min read
Oct 8, 2023
1.4K
11
Share
More
Caching the Key (K) and Value (V) states of generative transformers has been around for a while, but maybe you need to understand what it is exactly, and the great inference speedups that it provides.
The Key and Value states are used for calculating the scaled dot-product attention, as seen in the image below.

Scaled dot-product attention and where it is applied in the transformer architecture. (Image source: https://lilianweng.github.io/posts/2018-06-24-attention/#full-architecture)
KV caching occurs during multiple token generation steps and only happens in the decoder (i.e., in decoder-only models like GPT, or in the decoder part of encoder-decoder models like T5). Models like BERT are not generative and therefore do not have KV caching.