For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
PagedKVCacheManager
PagedKVCacheManagerβ
class max.pipelines.kv_cache.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False, *, max_batch_size)
Bases: object
Paged KVCache manager with data and tensor parallelism support.
kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)
# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0)
kv_manager.alloc(ctx2, replica_idx=1)
# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.runtime_inputs([[ctx1, ctx2]])
# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)
# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])
# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)Initialize the multi-device paged KV cache manager.
-
Parameters:
-
- params (KVCacheParamInterface) β KV cache parameters. Pass
MultiKVCacheParamsfor models with more than one KV cache. - session (InferenceSession) β The MAX Engine inference session.
- total_num_pages (int) β The total number of pages to allocate.
- total_num_host_pages (int) β The total number of host pages to allocate.
- max_batch_size (int) β Maximum runtime batch size used to preallocate per-replica runtime lookup-table/cache-length row capacity.
- enable_runtime_checks (bool) β Whether to enable runtime checks.
- params (KVCacheParamInterface) β KV cache parameters. Pass
alloc()β
alloc(data, replica_idx)
Allocates blocks for a request.
When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache and the contextβs active token window is advanced accordingly.
-
Parameters:
-
- data (TextContext) β The text generation context for the request. The request ID
must already be assigned to a replica via
claim. - replica_idx (int) β Index of the replica to allocate on.
- data (TextContext) β The text generation context for the request. The request ID
must already be assigned to a replica via
-
Raises:
-
- InsufficientBlocksError β If there are insufficient free blocks to
- satisfy the allocation. β
-
Return type:
-
None
alloc_dummy()β
alloc_dummy(request_id, replica_idx)
Claims a dummy request and maps it to the replicaβs null block.
claim()β
claim(request_id, replica_idx)
Reserves a sequence ID for the given request ID.
contains()β
contains(request_id, replica_idx)
Returns whether the request is present on the given replica.
get_device_buffer()β
get_device_buffer(replica_idx)
Returns the replicaβs KV buffer (single leaf or tree).
HACK: this exists only for the transfer engine; callers flatten via
KVCacheBufferInterface.all_buffers.
-
Parameters:
-
replica_idx (int)
-
Return type:
-
KVCacheBufferInterface
get_metrics_aggregated()β
get_metrics_aggregated()
Returns aggregated metrics across all replicas.
-
Return type:
get_num_disk_pages()β
get_num_disk_pages(replica_idx)
Returns number of disk pages for the replica.
get_num_host_pages()β
get_num_host_pages(replica_idx)
Returns number of host pages for the replica.
get_num_pages()β
get_num_pages(replica_idx)
Returns total number of pages for the replica.
get_num_used_disk_pages()β
get_num_used_disk_pages(replica_idx)
Returns number of used disk pages for the replica.
get_num_used_host_pages()β
get_num_used_host_pages(replica_idx)
Returns number of used host pages for the replica.
get_num_used_pages()β
get_num_used_pages(replica_idx)
Returns number of used pages for the replica.
get_pct_used_blocks_after_allocation()β
get_pct_used_blocks_after_allocation(ctx, replica_idx)
Gets the percentage of blocks used after allocating for a request.
-
Parameters:
-
- ctx (TextContext) β The request context containing sequence information and token indices.
- replica_idx (int) β Index of the replica to query.
-
Returns:
-
The percentage of total blocks used after allocating for the request.
-
Return type:
get_req_blocks()β
get_req_blocks(request_id, replica_idx)
Returns block IDs for the request on the given replica.
num_free_blocks()β
num_free_blocks(replica_idx=0)
Returns the number of free KV cache blocks on the given replica.
release()β
release(request_id, replica_idx)
Releases blocks for the request on the given replica.
reserve()β
reserve(replica_batches)
Claims, allocates, and releases contexts within a scope.
This helper is for ephemeral flows (for example, warmup capture) where request IDs should be released when leaving the scope.
-
Parameters:
-
replica_batches (Sequence[Sequence[TextContext]]) β Per-replica lists of contexts to reserve.
-
Return type:
-
Iterator[None]
reset_metrics()β
reset_metrics()
Resets metrics for the block manager.
-
Return type:
-
None
reset_prefix_cache()β
reset_prefix_cache()
Resets the device prefix caches and every connectorβs tiers.
-
Return type:
-
None
runtime_inputs()β
runtime_inputs(batches, *, max_cache_length=None, batch_characteristics=None)
Gets the graph inputs for per-replica batches of requests.
Returns a single KVCacheInputs leaf (or MultiKVCacheInputs
tree for multi-cache models) whose leaves hold every
(DP replica, TP shard) deviceβs inputs.
This method will raise a RuntimeError if any request has insufficient blocks already allocated to it.
-
Parameters:
-
- batches (Sequence[Sequence[TextContext]]) β Per-replica batches of requests
- max_cache_length (int | None) β Optional explicit max cache length to size LUT views. If not provided, uses request-derived runtime length.
- batch_characteristics (BatchCharacteristics | None) β Optional upper-bound batch shape applied
uniformly across every replica when preparing attention dispatch
metadata. When provided (e.g. graph-capture replay, where every
DP replica must run the identical captured graph), the dispatch
key is resolved once from these aligned values; the real
per-replica values must not exceed them. When
None, each replica prepares metadata from its own real values (which may differ per replica).
-
Return type:
runtime_inputs_for_leaf()β
runtime_inputs_for_leaf(batches, *, max_cache_length=None, batch_characteristics=None)
Returns runtime_inputs() narrowed to a single leaf cache.
Convenience wrapper for single-cache (non-tree) models: it asserts the
result is a KVCacheInputs leaf and returns it, so callers can
access .inputs directly without narrowing the
KVCacheInputsInterface themselves. Raises AssertionError
for tree (MultiKVCacheInputs) models.
-
Parameters:
-
- batches (Sequence[Sequence[TextContext]])
- max_cache_length (int | None)
- batch_characteristics (BatchCharacteristics | None)
-
Return type:
step()β
step(batches)
Commits new tokens into the prefix cache for per-replica batches.
-
Parameters:
-
batches (Sequence[Sequence[TextContext]])
-
Return type:
-
None
total_num_blocks()β
total_num_blocks(replica_idx=0)
Returns the total number of KV cache blocks on the given replica.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!