For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Prefix caching with PagedAttention
Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes.
Prefix caching is enabled by default when serving a model with the max serve CLI command. It can be disabled using the
--no-enable-prefix-caching flag.
When to use prefix cachingβ
Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput.
Prefix caching can provide significant performance improvements in the following scenarios:
- Similar queries: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation.
- Multi-round conversations: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session.
Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases:
- Unique queries: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective.
- Long response generation: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact.
How prefix caching worksβ
Prefix caching works by storing the key-value (KV) cache for a prefix and applying it to future prompts that include the same prefix, reducing redundant computation. It's built on PagedAttention, which MAX uses by default to manage the KV cache. The following flags control prefix caching behavior:
--enable-prefix-caching: enables prefix caching. This is on by default, so you only need it to re-enable caching after disabling it. Use--no-enable-prefix-cachingto disable it.--kv-cache-page-size: sets the number of tokens per page in the paged KV cache. PagedAttention requires a page size that is a multiple of 128.
Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model
with prefix caching using the max CLI, you can use the flag --devices cpu
for CPU or --devices gpu for GPU workloads. If you omit --devices, MAX uses
the model or config default.
Quickstartβ
Prefix caching is enabled by default when serving a model with MAX. To install
the max CLI, see the installation guide.
The following command serves Gemma 3 with prefix caching enabled. The default
KV cache page size for Gemma 3 is 256.
max serve --model google/gemma-3-27b-itDisable prefix cachingβ
To disable prefix caching when serving a model, use the following command:
max serve --model google/gemma-3-27b-it --no-enable-prefix-cachingEnable prefix cachingβ
You can explicitly enable prefix caching when serving your model with the
max CLI. The following command represents the default MAX
behavior:
max serve --model google/gemma-3-27b-it \
--enable-prefix-caching \
--kv-cache-page-size 256Next stepsβ
Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX on GPUs.
MAX also includes a benchmarking script that allows you to evaluate throughput,
latency, and GPU utilization metrics. You can use this script to track
performance gains from prefix caching. For more detailed instructions on
benchmarking, see the
max benchmark docs.
- Speculative decoding: Use speculative decoding to accelerate LLM inference.
- Benchmark MAX on NVIDIA or AMD GPUs: Learn how to use our benchmarking script to measure the performance of MAX.
- Deploy MAX on GPU with self-hosted endpoints: Learn how to deploy MAX pipelines to cloud.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!