Prefix caching with PagedAttention
Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes.
By default, prefix caching is disabled in MAX. It can be enabled using the
--enable-prefix-caching
flag.
Enable prefix caching
To enable prefix caching, you can run the following command. You'll also need
our magic
package manager and the max-pipelines
CLI tool.
max-pipelines serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--cache-strategy paged \
--enable-prefix-caching \
--kv-cache-page-size 128 \
--quantization-encoding float32
max-pipelines serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--cache-strategy paged \
--enable-prefix-caching \
--kv-cache-page-size 128 \
--quantization-encoding float32
Configuration options
Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model
with prefix caching using MAX Serve, you can use the flag --devices cpu
for
CPU or --devices gpu
for GPU workloads. If no flag is provided, the model runs
on the first available GPU, or on the first available CPU if no GPUs are
available.
--cache-strategy
: Prefix caching requires PagedAttention. To use PagedAttention, set your cache strategy topaged
.--enable-prefix-caching
: Enables prefix caching.--kv-cache-page-size
: PagedAttention currently requires a page size that is a multiple of 128.
When to use prefix caching
Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput.
Prefix caching can provide significant performance improvements in the following scenarios:
- Similar queries: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation.
- Multi-round conversations: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session.
Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases:
- Unique queries: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective.
- Long response generation: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact.
Next steps
Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX Serve on GPUs.
MAX also includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. You can use this script to track performance gains from prefix caching. For more detailed instructions on benchmarking, please see Benchmark MAX Serve.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!