For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

PagedKVCacheManager

`PagedKVCacheManager`

class max.pipelines.kv_cache.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False, *, max_batch_size)

source

Bases: object

Paged KVCache manager with data and tensor parallelism support.

kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)

# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0)
kv_manager.alloc(ctx2, replica_idx=1)

# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.runtime_inputs([[ctx1, ctx2]])

# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)

# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])

# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)

Initialize the multi-device paged KV cache manager.

Parameters:

params (KVCacheParamInterface) – KV cache parameters. Pass MultiKVCacheParams for models with more than one KV cache.
session (InferenceSession) – The MAX Engine inference session.
total_num_pages (int) – The total number of pages to allocate.
total_num_host_pages (int) – The total number of host pages to allocate.
max_batch_size (int) – Maximum runtime batch size used to preallocate per-replica runtime lookup-table/cache-length row capacity.
enable_runtime_checks (bool) – Whether to enable runtime checks.

`alloc()`

alloc(data, replica_idx)

source

Allocates blocks for a request.

When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache and the context’s active token window is advanced accordingly.

Parameters:

data (TextContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
replica_idx (int) – Index of the replica to allocate on.

Raises:

InsufficientBlocksError – If there are insufficient free blocks to
satisfy the allocation. –

Return type:

None

`alloc_dummy()`

alloc_dummy(request_id, replica_idx)

source

Claims a dummy request and maps it to the replica’s null block.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

None

`claim()`

claim(request_id, replica_idx)

source

Reserves a sequence ID for the given request ID.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

None

`contains()`

contains(request_id, replica_idx)

source

Returns whether the request is present on the given replica.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

bool

`get_device_buffer()`

get_device_buffer(replica_idx)

source

Returns the replica’s KV buffer (single leaf or tree).

HACK: this exists only for the transfer engine; callers flatten via KVCacheBufferInterface.all_buffers.

Parameters:: replica_idx (int)
Return type:: KVCacheBufferInterface

`get_metrics_aggregated()`

get_metrics_aggregated()

source

Returns aggregated metrics across all replicas.

Return type:: KVCacheMetrics

`get_num_disk_pages()`

get_num_disk_pages(replica_idx)

source

Returns number of disk pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_num_host_pages()`

get_num_host_pages(replica_idx)

source

Returns number of host pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_num_pages()`

get_num_pages(replica_idx)

source

Returns total number of pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_num_used_disk_pages()`

get_num_used_disk_pages(replica_idx)

source

Returns number of used disk pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_num_used_host_pages()`

get_num_used_host_pages(replica_idx)

source

Returns number of used host pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_num_used_pages()`

get_num_used_pages(replica_idx)

source

Returns number of used pages for the replica.

Parameters:: replica_idx (int)
Return type:: int

`get_pct_used_blocks_after_allocation()`

get_pct_used_blocks_after_allocation(ctx, replica_idx)

source

Gets the percentage of blocks used after allocating for a request.

Parameters:

ctx (TextContext) – The request context containing sequence information and token indices.
replica_idx (int) – Index of the replica to query.

Returns:

The percentage of total blocks used after allocating for the request.

Return type:

float

`get_req_blocks()`

get_req_blocks(request_id, replica_idx)

source

Returns block IDs for the request on the given replica.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

list[int]

`num_free_blocks()`

num_free_blocks(replica_idx=0)

source

Returns the number of free KV cache blocks on the given replica.

Parameters:: replica_idx (int)
Return type:: int

`release()`

release(request_id, replica_idx)

source

Releases blocks for the request on the given replica.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

None

`reserve()`

reserve(replica_batches)

source

Claims, allocates, and releases contexts within a scope.

This helper is for ephemeral flows (for example, warmup capture) where request IDs should be released when leaving the scope.

Parameters:: replica_batches (Sequence[Sequence[TextContext]]) – Per-replica lists of contexts to reserve.
Return type:: Iterator[None]

`reset_metrics()`

reset_metrics()

source

Resets metrics for the block manager.

Return type:: None

`reset_prefix_cache()`

reset_prefix_cache()

source

Resets the device prefix caches and every connector’s tiers.

Return type:: None

`runtime_inputs()`

runtime_inputs(batches, *, max_cache_length=None, batch_characteristics=None)

source

Gets the graph inputs for per-replica batches of requests.

Returns a single KVCacheInputs leaf (or MultiKVCacheInputs tree for multi-cache models) whose leaves hold every (DP replica, TP shard) device’s inputs.

This method will raise a RuntimeError if any request has insufficient blocks already allocated to it.

Parameters:

batches (Sequence[Sequence[TextContext]]) – Per-replica batches of requests
max_cache_length (int | None) – Optional explicit max cache length to size LUT views. If not provided, uses request-derived runtime length.
batch_characteristics (BatchCharacteristics | None) – Optional upper-bound batch shape applied uniformly across every replica when preparing attention dispatch metadata. When provided (e.g. graph-capture replay, where every DP replica must run the identical captured graph), the dispatch key is resolved once from these aligned values; the real per-replica values must not exceed them. When None, each replica prepares metadata from its own real values (which may differ per replica).

Return type:

KVCacheInputsInterface[Buffer, Buffer]

`runtime_inputs_for_leaf()`

runtime_inputs_for_leaf(batches, *, max_cache_length=None, batch_characteristics=None)

source

Returns runtime_inputs() narrowed to a single leaf cache.

Convenience wrapper for single-cache (non-tree) models: it asserts the result is a KVCacheInputs leaf and returns it, so callers can access .inputs directly without narrowing the KVCacheInputsInterface themselves. Raises AssertionError for tree (MultiKVCacheInputs) models.

Parameters:

batches (Sequence[Sequence[TextContext]])
max_cache_length (int | None)
batch_characteristics (BatchCharacteristics | None)

Return type:

KVCacheInputs[Buffer, Buffer]

`step()`

step(batches)

source

Commits new tokens into the prefix cache for per-replica batches.

Parameters:: batches (Sequence[Sequence[TextContext]])
Return type:: None

`total_num_blocks()`

total_num_blocks(replica_idx=0)

source

Returns the total number of KV cache blocks on the given replica.

Parameters:: replica_idx (int)
Return type:: int

PagedKVCacheManager​

alloc()​

alloc_dummy()​

claim()​

contains()​

get_device_buffer()​

get_metrics_aggregated()​

get_num_disk_pages()​

get_num_host_pages()​

get_num_pages()​

get_num_used_disk_pages()​

get_num_used_host_pages()​

get_num_used_pages()​

get_pct_used_blocks_after_allocation()​

get_req_blocks()​

num_free_blocks()​

release()​

reserve()​

reset_metrics()​

reset_prefix_cache()​

runtime_inputs()​

runtime_inputs_for_leaf()​

step()​

total_num_blocks()​

`PagedKVCacheManager`

`alloc()`

`alloc_dummy()`

`claim()`

`contains()`

`get_device_buffer()`

`get_metrics_aggregated()`

`get_num_disk_pages()`

`get_num_host_pages()`

`get_num_pages()`

`get_num_used_disk_pages()`

`get_num_used_host_pages()`

`get_num_used_pages()`

`get_pct_used_blocks_after_allocation()`

`get_req_blocks()`

`num_free_blocks()`

`release()`

`reserve()`

`reset_metrics()`

`reset_prefix_cache()`

`runtime_inputs()`

`runtime_inputs_for_leaf()`

`step()`

`total_num_blocks()`