Skip to main content
Log in

Prefix caching with PagedAttention

Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes.

By default, prefix caching is disabled in MAX. It can be enabled using the --enable-prefix-caching flag.

Enable prefix caching

To enable prefix caching, you can run the following command. You'll also need our magic package manager and the max-pipelines CLI tool.

max-pipelines serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--cache-strategy paged \
--enable-prefix-caching \
--kv-cache-page-size 128 \
--quantization-encoding float32
max-pipelines serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--cache-strategy paged \
--enable-prefix-caching \
--kv-cache-page-size 128 \
--quantization-encoding float32

Configuration options

Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model with prefix caching using MAX Serve, you can use the flag --devices cpu for CPU or --devices gpu for GPU workloads. If no flag is provided, the model runs on the first available GPU, or on the first available CPU if no GPUs are available.

  • --cache-strategy : Prefix caching requires PagedAttention. To use PagedAttention, set your cache strategy to paged.
  • --enable-prefix-caching: Enables prefix caching.
  • --kv-cache-page-size: PagedAttention currently requires a page size that is a multiple of 128.

When to use prefix caching

Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput.

Prefix caching can provide significant performance improvements in the following scenarios:

  1. Similar queries: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation.
  2. Multi-round conversations: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session.

Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases:

  • Unique queries: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective.
  • Long response generation: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact.

Next steps

Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX Serve on GPUs.

MAX also includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. You can use this script to track performance gains from prefix caching. For more detailed instructions on benchmarking, please see Benchmark MAX Serve.