Table of Contents

Key-Value (KV) caching is a technique used in large language model (LLM) inference to store the key and value tensors from previous decoding steps. By reusing these stored tensors for each new token’s attention computation, KV caching avoids redundant calculations and significantly accelerates autoregressive generation. This review covers recent theoretical advancements in KV caching (2024–2025), practical integration strategies in model architectures, real-world enterprise use cases, comparisons with alternative optimizations, framework-specific implementations (PyTorch, TensorFlow, JAX), and performance benchmarks from cutting-edge LLMs like Mistral, DeepSeek, and OpenAI’s latest models.

Theoretical Advancements in KV Caching


Avoiding Recomputing Attention: In a transformer decoder, generating each new token involves computing self-attention against all prior tokens. KV caching mitigates this by storing past keys and values so that each iteration only computes attention for the latest token’s queries. This yields a dramatic speedup: after the first token (which computes full attention for the prompt), subsequent tokens reuse cached KV pairs and only add one new key and value each time. The result is a roughly constant per-token latency after the first token, greatly improving throughput for long sequences.

Challenges with Cache Growth: A drawback is that the KV cache grows linearly with sequence length, consuming substantial memory. For each generated token, caches must store a key and value vector per transformer layer and head. For example, LLaMA-2 13B uses ~1 MB of cache per output token. Over a 4K token context, this is ~4 GB per sequence – comparable to the model size – and even larger for bigger models or batches. This growth leads to memory bottlenecks and increased attention bandwidth, especially in long-context or long-response tasks.

Recent Research (2024–2025): A surge of research aims to compress or limit KV caches without sacrificing model performance: