
Table of Contents
- KV Caching in LLM Inference A Comprehensive Review
- Theoretical Advancements in KV Caching
- Practical Implementation in Modern LLMs
- Industry Applications of KV Caching
- KV Caching vs Other Optimization Techniques
- KV Caching in PyTorch TensorFlow and JAX
- Performance Benchmarks and Model Comparisons
Key-Value (KV) caching is a technique used in large language model (LLM) inference to store the key and value tensors from previous decoding steps. By reusing these stored tensors for each new token’s attention computation, KV caching avoids redundant calculations and significantly accelerates autoregressive generation. This review covers recent theoretical advancements in KV caching (2024–2025), practical integration strategies in model architectures, real-world enterprise use cases, comparisons with alternative optimizations, framework-specific implementations (PyTorch, TensorFlow, JAX), and performance benchmarks from cutting-edge LLMs like Mistral, DeepSeek, and OpenAI’s latest models.
Theoretical Advancements in KV Caching
Avoiding Recomputing Attention: In a transformer decoder, generating each new token involves computing self-attention against all prior tokens. KV caching mitigates this by storing past keys and values so that each iteration only computes attention for the latest token’s queries. This yields a dramatic speedup: after the first token (which computes full attention for the prompt), subsequent tokens reuse cached KV pairs and only add one new key and value each time. The result is a roughly constant per-token latency after the first token, greatly improving throughput for long sequences.
Challenges with Cache Growth: A drawback is that the KV cache grows linearly with sequence length, consuming substantial memory. For each generated token, caches must store a key and value vector per transformer layer and head. For example, LLaMA-2 13B uses ~1 MB of cache per output token. Over a 4K token context, this is ~4 GB per sequence – comparable to the model size – and even larger for bigger models or batches. This growth leads to memory bottlenecks and increased attention bandwidth, especially in long-context or long-response tasks.
Recent Research (2024–2025): A surge of research aims to compress or limit KV caches without sacrificing model performance:
- Constant-Size Caches: MorphKV (2025) introduces an adaptive method to maintain a fixed-size KV cache by selectively retaining the most relevant key/value pairs. Instead of dropping old tokens arbitrarily, MorphKV uses attention patterns to iteratively refine which past tokens to keep, preserving long-range dependencies with minimal accuracy loss. This yields >50% memory savings over prior methods while even improving long-form accuracy in benchmarks.
- Cache Compression: MiniCache (2024) compresses the KV cache across layers by merging adjacent layers’ states. It observes high similarity in deep layers’ KV tensors, so it “disentangles” each state into magnitude and direction, then interpolates directions between layers to reduce depth redundancy. MiniCache achieved up to 5× compression (e.g. 41% less memory with near-lossless performance) and ~5× throughput gain on LLaMA-2 using 4-bit compressed caches.
- Selective Retention: SnapKV (2024) takes a fine-tuning-free approach by selecting only the “important” past token positions for each attention head. It finds that each head mainly attends to a subset of prompt features, identifiable via an observation window. SnapKV clusters and keeps those crucial KV entries, discarding others. This yields up to 3.6× faster generation and 8.2× lower memory use on 16K-token inputs, with negligible accuracy drop. Impressively, SnapKV enabled processing a 380K-token context on a single 80GB GPU (Qwen-7B model) with only minor quality loss.
- Quantization of KV: AQUA-KV (2024) (“Cache Me If You Must”) dynamically quantizes KV tensors to shrink memory while maintaining accuracy. By adaptively allocating precision based on content, AQUA-KV achieved higher compression rates on LLaMA 3.x models compared to static quantization, supporting extremely long contexts (aiming for 10M token inference). Other works like KVQuant pursue 4-bit or mixed-precision KV caching to reach context lengths that were previously infeasible.