Large Language Models (LLMs) are powerful, but their performance can be bottlenecked by the immense NVIDIA GPU memory footprint of the Key-Value (KV) Cache. This cache, crucial for speeding up LLM inference by storing Key (K) and Value (V) matrices, directly impacts context length, concurrency, and overall system throughput. Our primary goal is to maximize the KV Cache hit ratio by intelligently expanding NVIDIA GPU High Bandwidth Memory (HBM) with a tiered node-local storage solution.

Our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, and Yihua Cheng from Tensormesh) has led to the development of an innovative solution on Google Kubernetes Engine (GKE).

Tiered Storage: Expanding the KV Cache Beyond HBM

LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs. This dramatically increases the total cache size, leading to a higher hit ratio and improved inference performance by keeping more data locally on the accelerator node. For GKE users, this means accommodating models with massive context windows while maintaining excellent performance.

Performance Benchmarking and Results

We designed tests to measure the performance of this tiered KV Cache by configuring workloads to fill each storage layer (HBM, CPU RAM, Local SSD). We benchmarked these configurations using various context lengths (1k, 5k, 10k, 50k, and 100k tokens), representing diverse use cases such as:

Our primary performance indicators were Time to First Token (TTFT), token input throughput, and end-to-end latency. The results highlight the best-performing storage setup for each KV Cache size and the performance improvements achieved.

Experiment Setup

We deployed a vLLM server on an A3 mega machine, leveraging local SSD for ephemeral storage via emptyDir.