KV cache
A memory structure used in transformer models to store key-value tensors output from self-attention layers. The KV (key-value) cache speeds up inference for transformer models such as large language models (LLMs) by avoiding the need to recompute the self-attention scores for all previous tokens in a sequence.
For example, suppose an LLM is trying to complete the sentence, "The quick brown fox..." After the model predicts "jumps" and then begins to predict the next token, the model must know the attention score for every token in the sequence so far (including the one it just predicted). That is, for each step in the autoregression cycle, it must process the entire sequence thus far:
- "The quick brown fox..."
- "The quick brown fox jumps..."
- "The quick brown fox jumps over..."
And so on.
By storing the already-calculated attention scores for previous tokens in KV cache, the model simply reads the KV cache at each step, instead of recomputing those scores all over again. Once the model predicts the next token and calculates its self-attention, it adds it to the KV cache.
As the sequence length grows during inference (as more words are generated), the KV cache becomes the dominant factor in a model's memory usage. The sequence length is always limited by the model's total context window length, which varies across models and can usually be configured.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!