Skip to main content

Python class

PagedKVCacheManager

PagedKVCacheManager

class max.kv_cache.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False, *, max_batch_size)

source

Bases: object

Paged KVCache manager with data and tensor parallelism support.

kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)

# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0, num_steps=10)
kv_manager.alloc(ctx2, replica_idx=1, num_steps=10)

# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.runtime_inputs(
    [[ctx1, ctx2]], num_steps=10
)

# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)

# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])

# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)

Initialize the multi-device paged KV cache manager.

Parameters:

  • params (KVCacheParamInterface) – KV cache parameters. Pass MultiKVCacheParams for models with more than one KV cache.
  • session (InferenceSession) – The MAX Engine inference session.
  • total_num_pages (int) – The total number of pages to allocate.
  • total_num_host_pages (int) – The total number of host pages to allocate.
  • max_batch_size (int) – Maximum runtime batch size used to preallocate per-replica runtime lookup-table/cache-length row capacity.
  • enable_runtime_checks (bool) – Whether to enable runtime checks.

alloc()

alloc(data, replica_idx, num_steps=1)

source

Allocates blocks for a request to run for N steps.

When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache and the context’s active token window is advanced accordingly.

Parameters:

  • data (TextGenerationContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
  • replica_idx (int) – Index of the replica to allocate on.
  • num_steps (int) – The number of steps to reserve blocks for. Default: 1.

Raises:

Return type:

None

alloc_dummy()

alloc_dummy(request_id, replica_idx, sentinel_request_id)

source

Claims a dummy request and shares the sentinel’s block on a replica.

Parameters:

Return type:

None

cache_params()

cache_params(cache_idx=0)

source

Returns the KVCacheParams for a specific cache.

Parameters:

cache_idx (int)

Return type:

KVCacheParams

claim()

claim(request_id, replica_idx)

source

Reserves a sequence ID for the given request ID.

Parameters:

Return type:

None

contains()

contains(request_id, replica_idx)

source

Returns whether the request is present on the given replica.

Parameters:

Return type:

bool

dispatch_resolver()

dispatch_resolver(replica_idx=0)

source

Returns the attention dispatch resolver for a replica.

Parameters:

replica_idx (int)

Return type:

AttentionDispatchResolver

get_device_buffer()

get_device_buffer(replica_idx, cache_idx=0)

source

Returns device buffer for a specific cache on a replica.

Parameters:

  • replica_idx (int) – Index of the replica.
  • cache_idx (int) – Index of the cache (default 0 = primary cache).

Return type:

KVCacheBuffer

get_metrics()

get_metrics(replica_idx)

source

Returns metrics for the given replica.

Parameters:

replica_idx (int)

Return type:

KVCacheMetrics

get_num_host_pages()

get_num_host_pages(replica_idx)

source

Returns number of host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_pages()

get_num_pages(replica_idx)

source

Returns total number of pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_host_pages()

get_num_used_host_pages(replica_idx)

source

Returns number of used host pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_num_used_pages()

get_num_used_pages(replica_idx)

source

Returns number of used pages for the replica.

Parameters:

replica_idx (int)

Return type:

int

get_pct_used_blocks_after_allocation()

get_pct_used_blocks_after_allocation(ctx, replica_idx, num_steps=1)

source

Gets the percentage of blocks used after allocating for a request.

Parameters:

  • ctx (TextGenerationContext) – The request context containing sequence information and token indices.
  • replica_idx (int) – Index of the replica to query.
  • num_steps (int) – Number of additional steps to allocate blocks for. Defaults to 1.

Returns:

The percentage of total blocks used after allocating for the request.

Return type:

float

get_req_blocks()

get_req_blocks(request_id, replica_idx)

source

Returns block IDs for the request on the given replica.

Parameters:

Return type:

list[int]

num_caches

property num_caches: int

source

Number of KV caches managed (1 for single-cache, N for multi).

release()

release(request_id, replica_idx)

source

Releases blocks for the request on the given replica.

Parameters:

Return type:

None

reserve()

reserve(replica_batches, *, num_steps=1)

source

Claims, allocates, and releases contexts within a scope.

This helper is for ephemeral flows (for example, warmup capture) where request IDs should be released when leaving the scope.

Parameters:

Return type:

Iterator[None]

reset_metrics()

reset_metrics()

source

Resets metrics for all replica managers.

Return type:

None

reset_prefix_cache()

reset_prefix_cache()

source

Resets the prefix cache for all replica managers.

Return type:

None

runtime_inputs()

runtime_inputs(batches, num_steps=1, *, max_cache_length=None)

source

Gets the graph inputs for per-replica batches of requests.

This method will raise a RuntimeError if any request has insufficient blocks already allocated to it to run for the given number of steps.

Parameters:

  • batches (Sequence[Sequence[TextGenerationContext]]) – Per-replica batches of requests
  • num_steps (int) – Number of steps to run for
  • max_cache_length (int | None) – Optional explicit max cache length to size LUT views. If not provided, uses request-derived runtime length.

Return type:

KVCacheInputs[Buffer, Buffer]

scalar_metadata_on_host()

scalar_metadata_on_host()

source

Temporarily keep scalar dispatch metadata on CPU.

Within this context the attention dispatch resolvers return host buffers so that graph-capture replay can perform a single CPU-to-GPU inplace_copy_from instead of a redundant GPU-to-GPU copy.

Return type:

Iterator[None]

step()

step(batches)

source

Commits new tokens into the prefix cache for per-replica batches.

Parameters:

batches (Sequence[Sequence[TextGenerationContext]])

Return type:

None