Python module

cache_manager

`PagedKVCacheManager`

class max.kv_cache.paged_kv_cache.cache_manager.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False)

Paged KVCache manager with data and tensor parallelism support.

kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)

# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0, num_steps=10)
kv_manager.alloc(ctx2, replica_idx=1, num_steps=10)

# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.get_runtime_inputs(
    [[ctx1, ctx2]], num_steps=10
)

# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)

# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])

# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)

Parameters:

params (KVCacheParams)
session (InferenceSession)
total_num_pages (int)
total_num_host_pages (int)
enable_runtime_checks (bool)

`alloc()`

alloc(data, replica_idx, num_steps=1)

Allocates blocks for a request to run for N steps.

This method allocates blocks needed by a request to run for N steps. When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache.

Parameters:

data (TextGenerationContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
num_steps (int) – The number of steps to reserve blocks for. Default: 1.
replica_idx (int)

Raises:

InsufficientBlocksError – If there are insufficient free blocks to
satisfy the allocation. –

Return type:

None

`claim()`

claim(request_id, replica_idx)

Reserve a sequence ID for the given request ID.

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

None

`contains()`

contains(request_id, replica_idx)

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

bool

`get_device_tensors()`

get_device_tensors(replica_idx)

Parameters:: replica_idx (int)
Return type:: list[Buffer]

`get_metrics()`

get_metrics(replica_idx)

Parameters:: replica_idx (int)
Return type:: KVCacheMetrics

`get_num_host_pages()`

get_num_host_pages(replica_idx)

Parameters:: replica_idx (int)
Return type:: int

`get_num_pages()`

get_num_pages(replica_idx)

Parameters:: replica_idx (int)
Return type:: int

`get_num_used_host_pages()`

get_num_used_host_pages(replica_idx)

Parameters:: replica_idx (int)
Return type:: int

`get_num_used_pages()`

get_num_used_pages(replica_idx)

Parameters:: replica_idx (int)
Return type:: int

`get_pct_used_blocks_after_allocation()`

get_pct_used_blocks_after_allocation(ctx, replica_idx, num_steps=1)

Get the percentage of blocks used after allocating for a request.

Parameters:

ctx (TextGenerationContext) – The request context containing sequence information and token indices.
num_steps (int) – Number of additional steps to allocate blocks for. Defaults to 1.
replica_idx (int)

Returns:

The percentage of total blocks used after allocating for the request.

Return type:

float

`get_req_blocks()`

get_req_blocks(request_id, replica_idx)

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

list[int]

`get_runtime_inputs()`

get_runtime_inputs(batches, num_steps=1)

Get the graph inputs for per-replica batches of requests.

This method will raise a RuntimeError if any request has insufficient blocks already allocated to it to run for the given number of steps.

Parameters:

batches (Sequence[Sequence[TextGenerationContext]]) – Per-replica batches of requests
num_steps (int) – Number of steps to run for

Return type:

list[RaggedKVCacheInputs]

`increment_cache_lengths()`

increment_cache_lengths(kv_cache_inputs, prev_model_inputs)

Parameters:

kv_cache_inputs (Sequence[RaggedKVCacheInputs])
prev_model_inputs (Any)

Return type:

Sequence[RaggedKVCacheInputs]

`infer_optimal_batch_size()`

classmethod infer_optimal_batch_size(params, max_seq_len, available_cache_memory, devices, **kwargs)

Parameters:

params (KVCacheParamInterface)
max_seq_len (int)
available_cache_memory (int)
devices (Sequence[Device])
kwargs (Any)

Return type:

int

`release()`

release(request_id, replica_idx)

Parameters:

request_id (RequestID)
replica_idx (int)

Return type:

None

`reset_metrics()`

reset_metrics()

Return type:: None

`reset_prefix_cache()`

reset_prefix_cache()

Return type:: None

`step()`

step(batches)

Commit new tokens into the prefix cache for per-replica batches.

Parameters:: batches (Sequence[Sequence[TextGenerationContext]])
Return type:: None

PagedKVCacheManager​

alloc()​

claim()​

contains()​

get_device_tensors()​

get_metrics()​

get_num_host_pages()​

get_num_pages()​

get_num_used_host_pages()​

get_num_used_pages()​

get_pct_used_blocks_after_allocation()​

get_req_blocks()​

get_runtime_inputs()​

increment_cache_lengths()​

infer_optimal_batch_size()​

release()​

reset_metrics()​

reset_prefix_cache()​

step()​

`PagedKVCacheManager`

`alloc()`

`claim()`

`contains()`

`get_device_tensors()`

`get_metrics()`

`get_num_host_pages()`

`get_num_pages()`

`get_num_used_host_pages()`

`get_num_used_pages()`

`get_pct_used_blocks_after_allocation()`

`get_req_blocks()`

`get_runtime_inputs()`

`increment_cache_lengths()`

`infer_optimal_batch_size()`

`release()`

`reset_metrics()`

`reset_prefix_cache()`

`step()`