IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

PagedKVCacheManager

PagedKVCacheManager​

class max.pipelines.kv_cache.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False, *, max_batch_size)

source

Bases: object

Paged KVCache manager with data and tensor parallelism support.

kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)

# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0)
kv_manager.alloc(ctx2, replica_idx=1)

# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.runtime_inputs([[ctx1, ctx2]])

# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)

# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])

# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)

Initialize the multi-device paged KV cache manager.

Parameters:

  • params (KVCacheParamInterface) – KV cache parameters. Pass MultiKVCacheParams for models with more than one KV cache.
  • session (InferenceSession) – The MAX Engine inference session.
  • total_num_pages (int) – The total number of pages to allocate.
  • total_num_host_pages (int) – The total number of host pages to allocate.
  • max_batch_size (int) – Maximum runtime batch size used to preallocate per-replica runtime lookup-table/cache-length row capacity.
  • enable_runtime_checks (bool) – Whether to enable runtime checks.

alloc()​

alloc(data, replica_idx)

source

Allocates blocks for a request.

When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache and the context’s active token window is advanced accordingly.

Parameters:

  • data (TextContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
  • replica_idx (int) – Index of the replica to allocate on.

Raises:

Return type:

None

alloc_dummy()​

alloc_dummy(request_id, replica_idx)

source

Claims a dummy request and maps it to the replica’s null block.

Parameters:

Return type:

None

claim()​

claim(request_id, replica_idx)

source

Reserves a sequence ID for the given request ID.

Parameters:

Return type:

None

contains()​

contains(request_id, replica_idx)

source

Returns whether the request is present on the given replica.

Parameters:

Return type:

bool

get_device_buffer()​

get_device_buffer(replica_idx)

source

Returns the replica’s KV buffer (single leaf or tree).

HACK: this exists only for the transfer engine; callers flatten via KVCacheBufferInterface.all_buffers.

Parameters:

replica_idx (int)

Return type:

KVCacheBufferInterface

get_metrics_aggregated()​

get_metrics_aggregated()

source

Returns aggregated metrics across all replicas.

Return type:

KVCacheMetrics

get_num_disk_pages()​

get_num_disk_pages(replica_idx)

source

Returns number of disk pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_host_pages()​

get_num_host_pages(replica_idx)

source

Returns number of host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_pages()​

get_num_pages(replica_idx)

source

Returns total number of pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_disk_pages()​

get_num_used_disk_pages(replica_idx)

source

Returns number of used disk pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_host_pages()​

get_num_used_host_pages(replica_idx)

source

Returns number of used host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_pages()​

get_num_used_pages(replica_idx)

source

Returns number of used pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_pct_used_blocks_after_allocation()​

get_pct_used_blocks_after_allocation(ctx, replica_idx)

source

Gets the percentage of blocks used after allocating for a request.

Parameters:

  • ctx (TextContext) – The request context containing sequence information and token indices.
  • replica_idx (int) – Index of the replica to query.

Returns:

The percentage of total blocks used after allocating for the request.

Return type:

float

get_req_blocks()​

get_req_blocks(request_id, replica_idx)

source

Returns block IDs for the request on the given replica.

Parameters:

Return type:

list[int]

num_free_blocks()​

num_free_blocks(replica_idx=0)

source

Returns the number of free KV cache blocks on the given replica.

Parameters:

replica_idx (int)

Return type:

int

release()​

release(request_id, replica_idx)

source

Releases blocks for the request on the given replica.

Parameters:

Return type:

None

reserve()​

reserve(replica_batches)

source

Claims, allocates, and releases contexts within a scope.

This helper is for ephemeral flows (for example, warmup capture) where request IDs should be released when leaving the scope.

Parameters:

replica_batches (Sequence[Sequence[TextContext]]) – Per-replica lists of contexts to reserve.

Return type:

Iterator[None]

reset_metrics()​

reset_metrics()

source

Resets metrics for the block manager.

Return type:

None

reset_prefix_cache()​

reset_prefix_cache()

source

Resets the device prefix caches and every connector’s tiers.

Return type:

None

runtime_inputs()​

runtime_inputs(batches, *, max_cache_length=None, batch_characteristics=None)

source

Gets the graph inputs for per-replica batches of requests.

Returns a single KVCacheInputs leaf (or MultiKVCacheInputs tree for multi-cache models) whose leaves hold every (DP replica, TP shard) device’s inputs.

This method will raise a RuntimeError if any request has insufficient blocks already allocated to it.

Parameters:

  • batches (Sequence[Sequence[TextContext]]) – Per-replica batches of requests
  • max_cache_length (int | None) – Optional explicit max cache length to size LUT views. If not provided, uses request-derived runtime length.
  • batch_characteristics (BatchCharacteristics | None) – Optional upper-bound batch shape applied uniformly across every replica when preparing attention dispatch metadata. When provided (e.g. graph-capture replay, where every DP replica must run the identical captured graph), the dispatch key is resolved once from these aligned values; the real per-replica values must not exceed them. When None, each replica prepares metadata from its own real values (which may differ per replica).

Return type:

KVCacheInputsInterface[Buffer, Buffer]

runtime_inputs_for_leaf()​

runtime_inputs_for_leaf(batches, *, max_cache_length=None, batch_characteristics=None)

source

Returns runtime_inputs() narrowed to a single leaf cache.

Convenience wrapper for single-cache (non-tree) models: it asserts the result is a KVCacheInputs leaf and returns it, so callers can access .inputs directly without narrowing the KVCacheInputsInterface themselves. Raises AssertionError for tree (MultiKVCacheInputs) models.

Parameters:

Return type:

KVCacheInputs[Buffer, Buffer]

step()​

step(batches)

source

Commits new tokens into the prefix cache for per-replica batches.

Parameters:

batches (Sequence[Sequence[TextContext]])

Return type:

None

total_num_blocks()​

total_num_blocks(replica_idx=0)

source

Returns the total number of KV cache blocks on the given replica.

Parameters:

replica_idx (int)

Return type:

int