Prefix caching with PagedAttention

Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes.

By default, prefix caching is disabled in MAX. It can be enabled using the --enable-prefix-caching flag.

Prefix caching with MAX is still in preview and some aspects may change as we refine the implementation. Expect ongoing improvements and potential adjustments based on feedback and performance optimizations.

When to use prefix caching

Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput.

Prefix caching can provide significant performance improvements in the following scenarios:

Similar queries: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation.
Multi-round conversations: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session.

Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases:

Unique queries: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective.
Long response generation: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact.

How prefix caching works

Prefix caching works by storing the key-value (KV) cache for a prefix and applying it to future prompts that include the same prefix, reducing redundant computation. You must specify all of the following to use prefix caching with the max CLI:

--cache-strategy : Prefix caching requires PagedAttention. To use PagedAttention, set your cache strategy to paged.
--enable-prefix-caching: Enables prefix caching.
--kv-cache-page-size: PagedAttention currently requires a page size that is a multiple of 128.

Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model with prefix caching using the max CLI, you can use the flag --devices cpu for CPU or --devices gpu for GPU workloads. If no flag is provided, the model runs on the first available GPU, or on the first available CPU if no GPUs are available.

Quickstart

You can enable prefix caching when serving your model with the max CLI. To install the max CLI, see the installation guide.

max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
    --cache-strategy paged \
    --enable-prefix-caching \
    --kv-cache-page-size 128 \
    --quantization-encoding float32
max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
    --cache-strategy paged \
    --enable-prefix-caching \
    --kv-cache-page-size 128 \
    --quantization-encoding float32

Paged KV caching does not support quantized encodings. It may take some time to download the float32 weights. For more information about encoding options in MAX, see Quantization.

Next steps

Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX on GPUs.

MAX also includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. You can use this script to track performance gains from prefix caching. For more detailed instructions on benchmarking, please see Benchmark MAX.

When to use prefix caching​

How prefix caching works​

Quickstart​

Next steps​

When to use prefix caching

How prefix caching works

Quickstart

Next steps