Python module
cache_manager
PagedKVCacheManager
class max.kv_cache.paged_kv_cache.cache_manager.PagedKVCacheManager(params, session, total_num_pages, total_num_host_pages=0, enable_runtime_checks=False)
Paged KVCache manager with data and tensor parallelism support.
kv_manager.claim(ctx1.request_id, replica_idx=0)
kv_manager.claim(ctx2.request_id, replica_idx=1)
# Allocate blocks for these requests
kv_manager.alloc(ctx1, replica_idx=0, num_steps=10)
kv_manager.alloc(ctx2, replica_idx=1, num_steps=10)
# Get KVCache inputs to feed to graph
kv_cache_inputs = kv_manager.get_runtime_inputs(
[[ctx1, ctx2]], num_steps=10
)
# Run model...
# Update requests with newly generated tokens
ctx1.update(42)
ctx2.update(42)
# Commit newly written blocks to prefix cache
kv_manager.step([[ctx1, ctx2]])
# Release metadata and KV blocks for these requests
kv_manager.release(ctx1.request_id, replica_idx=0)
kv_manager.release(ctx2.request_id, replica_idx=1)-
Parameters:
-
- params (KVCacheParams)
- session (InferenceSession)
- total_num_pages (int)
- total_num_host_pages (int)
- enable_runtime_checks (bool)
alloc()
alloc(data, replica_idx, num_steps=1)
Allocates blocks for a request to run for N steps.
This method allocates blocks needed by a request to run for N steps. When prefix caching is enabled, some of the allocated blocks may be retrieved from the prefix cache.
-
Parameters:
-
- data (TextGenerationContext) – The text generation context for the request. The request ID must already be assigned to a replica via claim.
- num_steps (int) – The number of steps to reserve blocks for. Default: 1.
- replica_idx (int)
-
Raises:
-
- InsufficientBlocksError – If there are insufficient free blocks to
- satisfy the allocation. –
-
Return type:
-
None
claim()
claim(request_id, replica_idx)
Reserve a sequence ID for the given request ID.
contains()
contains(request_id, replica_idx)
get_device_tensors()
get_device_tensors(replica_idx)
get_metrics()
get_metrics(replica_idx)
-
Parameters:
-
replica_idx (int)
-
Return type:
-
KVCacheMetrics
get_num_host_pages()
get_num_host_pages(replica_idx)
get_num_pages()
get_num_pages(replica_idx)
get_num_used_host_pages()
get_num_used_host_pages(replica_idx)
get_num_used_pages()
get_num_used_pages(replica_idx)
get_pct_used_blocks_after_allocation()
get_pct_used_blocks_after_allocation(ctx, replica_idx, num_steps=1)
Get the percentage of blocks used after allocating for a request.
-
Parameters:
-
- ctx (TextGenerationContext) – The request context containing sequence information and token indices.
- num_steps (int) – Number of additional steps to allocate blocks for. Defaults to 1.
- replica_idx (int)
-
Returns:
-
The percentage of total blocks used after allocating for the request.
-
Return type:
get_req_blocks()
get_req_blocks(request_id, replica_idx)
get_runtime_inputs()
get_runtime_inputs(batches, num_steps=1)
Get the graph inputs for per-replica batches of requests.
This method will raise a RuntimeError if any request has insufficient blocks already allocated to it to run for the given number of steps.
-
Parameters:
-
- batches (Sequence[Sequence[TextGenerationContext]]) – Per-replica batches of requests
- num_steps (int) – Number of steps to run for
-
Return type:
-
list[RaggedKVCacheInputs]
increment_cache_lengths()
increment_cache_lengths(kv_cache_inputs, prev_model_inputs)
infer_optimal_batch_size()
classmethod infer_optimal_batch_size(params, max_seq_len, available_cache_memory, devices, **kwargs)
release()
release(request_id, replica_idx)
reset_metrics()
reset_metrics()
-
Return type:
-
None
reset_prefix_cache()
reset_prefix_cache()
-
Return type:
-
None
step()
step(batches)
Commit new tokens into the prefix cache for per-replica batches.
-
Parameters:
-
batches (Sequence[Sequence[TextGenerationContext]])
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!