For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Prefix caching with PagedAttention

Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes.

Prefix caching is enabled by default when serving a model with the max serve CLI command. It can be disabled using the --no-enable-prefix-caching flag.

When to use prefix caching

Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput.

Prefix caching can provide significant performance improvements in the following scenarios:

Similar queries: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation.
Multi-round conversations: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session.

Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases:

Unique queries: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective.
Long response generation: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact.

How prefix caching works

Prefix caching works by storing the key-value (KV) cache for a prefix and applying it to future prompts that include the same prefix, reducing redundant computation. It's built on PagedAttention, which MAX uses by default to manage the KV cache. The following flags control prefix caching behavior:

--enable-prefix-caching: enables prefix caching. This is on by default, so you only need it to re-enable caching after disabling it. Use --no-enable-prefix-caching to disable it.
--kv-cache-page-size: sets the number of tokens per page in the paged KV cache. PagedAttention requires a page size that is a multiple of 128.

Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model with prefix caching using the max CLI, you can use the flag --devices cpu for CPU or --devices gpu for GPU workloads. If you omit --devices, MAX uses the model or config default.

Quickstart

Prefix caching is enabled by default when serving a model with MAX. To install the max CLI, see the installation guide.

The following command serves Gemma 3 with prefix caching enabled. The default KV cache page size for Gemma 3 is 256.

max serve --model google/gemma-3-27b-it

MAX does not support prefix caching for multimodal models.

If you explicitly enable prefix caching for an incompatible model, you will see a log similar to the following (referencing KVCacheConfig):

WARNING: Architecture 'MllamaForConditionalGeneration' requires \
KVCacheConfig.enable_prefix_caching=False, overriding current value True

In this case, MAX automatically disables prefix caching and reverts to --no-enable-prefix-caching.

Disable prefix caching

To disable prefix caching when serving a model, use the following command:

max serve --model google/gemma-3-27b-it --no-enable-prefix-caching

Enable prefix caching

You can explicitly enable prefix caching when serving your model with the max CLI. The following command represents the default MAX behavior:

max serve --model google/gemma-3-27b-it \
    --enable-prefix-caching \
    --kv-cache-page-size 256

Next steps

Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX on GPUs.

MAX also includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. You can use this script to track performance gains from prefix caching. For more detailed instructions on benchmarking, see the max benchmark docs.

Speculative decoding: Use speculative decoding to accelerate LLM inference.
Benchmark MAX on NVIDIA or AMD GPUs: Learn how to use our benchmarking script to measure the performance of MAX.
Deploy MAX on GPU with self-hosted endpoints: Learn how to deploy MAX pipelines to cloud.

When to use prefix caching​

How prefix caching works​

Quickstart​

Disable prefix caching​

Enable prefix caching​

Next steps​