IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

PagedKVCacheManager

PagedKVCacheManager​

class max.pipelines.kv_cache.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False, *, max_batch_size, other_kv_managers_kv_buffers_per_replica=None)

source

Bases: object

Paged KVCache manager with data and tensor parallelism support.

kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)

# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0, num_steps=10)
kv_manager.alloc(ctx2, replica_idx=1, num_steps=10)

# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.runtime_inputs(
    [[ctx1, ctx2]], num_steps=10
)

# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)

# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])

# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)

Initialize the multi-device paged KV cache manager.

Parameters:

  • params (KVCacheParamInterface) – KV cache parameters. Pass MultiKVCacheParams for models with more than one KV cache.
  • session (InferenceSession) – The MAX Engine inference session.
  • total_num_pages (int) – The total number of pages to allocate.
  • total_num_host_pages (int) – The total number of host pages to allocate.
  • max_batch_size (int) – Maximum runtime batch size used to preallocate per-replica runtime lookup-table/cache-length row capacity.
  • enable_runtime_checks (bool) – Whether to enable runtime checks.
  • other_kv_managers_kv_buffers_per_replica (list[list[KVCacheBuffer]] | None) – KVCacheBuffers from other KV managers to be co-offloaded by this manager’s KVConnector.

alloc()​

alloc(data, replica_idx, num_steps=1)

source

Allocates blocks for a request to run for N steps.

When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache and the context’s active token window is advanced accordingly.

Parameters:

  • data (TextContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
  • replica_idx (int) – Index of the replica to allocate on.
  • num_steps (int) – The number of steps to reserve blocks for. Default: 1.

Raises:

Return type:

None

alloc_dummy()​

alloc_dummy(request_id, replica_idx)

source

Claims a dummy request and maps it to the replica’s null block.

Parameters:

Return type:

None

cache_params()​

cache_params(cache_idx=0)

source

Returns the KVCacheParams for a specific cache.

Parameters:

cache_idx (int)

Return type:

KVCacheParams

claim()​

claim(request_id, replica_idx)

source

Reserves a sequence ID for the given request ID.

Parameters:

Return type:

None

contains()​

contains(request_id, replica_idx)

source

Returns whether the request is present on the given replica.

Parameters:

Return type:

bool

dispatch_resolver()​

dispatch_resolver(replica_idx=0)

source

Returns the attention dispatch resolver for a replica.

Parameters:

replica_idx (int)

Return type:

AttentionDispatchResolver

get_device_buffer()​

get_device_buffer(replica_idx, cache_idx=0)

source

Returns device buffer for a specific cache on a replica.

Parameters:

  • replica_idx (int) – Index of the replica.
  • cache_idx (int) – Index of the cache (default 0 = primary cache).

Return type:

KVCacheBuffer

get_metrics_aggregated()​

get_metrics_aggregated()

source

Returns aggregated metrics across all replicas.

Return type:

KVCacheMetrics

get_num_disk_pages()​

get_num_disk_pages(replica_idx)

source

Returns number of disk pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_host_pages()​

get_num_host_pages(replica_idx)

source

Returns number of host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_pages()​

get_num_pages(replica_idx)

source

Returns total number of pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_disk_pages()​

get_num_used_disk_pages(replica_idx)

source

Returns number of used disk pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_host_pages()​

get_num_used_host_pages(replica_idx)

source

Returns number of used host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_pages()​

get_num_used_pages(replica_idx)

source

Returns number of used pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_pct_used_blocks_after_allocation()​

get_pct_used_blocks_after_allocation(ctx, replica_idx, num_steps=1)

source

Gets the percentage of blocks used after allocating for a request.

Parameters:

  • ctx (TextContext) – The request context containing sequence information and token indices.
  • replica_idx (int) – Index of the replica to query.
  • num_steps (int) – Number of additional steps to allocate blocks for. Defaults to 1.

Returns:

The percentage of total blocks used after allocating for the request.

Return type:

float

get_req_blocks()​

get_req_blocks(request_id, replica_idx)

source

Returns block IDs for the request on the given replica.

Parameters:

Return type:

list[int]

num_caches​

property num_caches: int

source

Number of KV caches managed (1 for single-cache, N for multi).

num_free_blocks()​

num_free_blocks(replica_idx=0)

source

Returns the number of free KV cache blocks on the given replica.

Parameters:

replica_idx (int)

Return type:

int

release()​

release(request_id, replica_idx)

source

Releases blocks for the request on the given replica.

Parameters:

Return type:

None

reserve()​

reserve(replica_batches, *, num_steps=1)

source

Claims, allocates, and releases contexts within a scope.

This helper is for ephemeral flows (for example, warmup capture) where request IDs should be released when leaving the scope.

Parameters:

  • replica_batches (Sequence[Sequence[TextContext]]) – Per-replica lists of contexts to reserve.
  • num_steps (int) – Number of steps to allocate for each context.

Return type:

Iterator[None]

reset_metrics()​

reset_metrics()

source

Resets metrics for all replica managers.

Return type:

None

reset_prefix_cache()​

reset_prefix_cache()

source

Resets the prefix cache for all replica managers.

Return type:

None

runtime_inputs()​

runtime_inputs(batches, num_steps=1, *, max_cache_length=None, batch_characteristics=None)

source

Gets the graph inputs for per-replica batches of requests.

This method will raise a RuntimeError if any request has insufficient blocks already allocated to it to run for the given number of steps.

Parameters:

  • batches (Sequence[Sequence[TextContext]]) – Per-replica batches of requests
  • num_steps (int) – Number of steps to run for
  • max_cache_length (int | None) – Optional explicit max cache length to size LUT views. If not provided, uses request-derived runtime length.
  • batch_characteristics (BatchCharacteristics | None) – Optional upper-bound batch shape applied uniformly across every replica when preparing attention dispatch metadata. When provided (e.g. graph-capture replay, where every DP replica must run the identical captured graph), the dispatch key is resolved once from these aligned values; the real per-replica values must not exceed them. When None, each replica prepares metadata from its own real values (which may differ per replica).

Return type:

KVCacheInputs[Buffer, Buffer]

step()​

step(batches)

source

Commits new tokens into the prefix cache for per-replica batches.

Parameters:

batches (Sequence[Sequence[TextContext]])

Return type:

None

total_num_blocks()​

total_num_blocks(replica_idx=0)

source

Returns the total number of KV cache blocks on the given replica.

Parameters:

replica_idx (int)

Return type:

int