Python module
manager
Abstract base class for KVCacheManager for KV Cache.
KVCacheInputSymbols
class max.pipelines.kv_cache.manager.KVCacheInputSymbols
Base class for input symbols for KV cache managers.
The derived class is responsible for defining the input symbols for the specific KV cache manager.
For example, here’s a derived class for a text KV cache manager: : ```pycon
@dataclass ... class ContinuousBatchingKVCacheInputSymbols(KVCacheInputSymbols): ... kv_blocks: TensorType ... cache_lengths: TensorType ... lookup_table: TensorType ... max_lengths: TensorType
## `KVCacheManager` \{#max.pipelines.kv_cache.manager.KVCacheManager}
> *class* max.pipelines.kv_cache.manager.KVCacheManager(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], session: [InferenceSession](../../engine.md#max.engine.InferenceSession), is_ragged: [bool](https://docs.python.org/3/library/functions.html#bool) = False)
### `claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.claim}
> claim(n: [int](https://docs.python.org/3/library/functions.html#int)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]
Claims n blocks of memory in the cache for incoming requests.
This returns a list of sequence ids, which identify a sequence’s
location within the cache. This sequence id can then be passed
in the fetch function to return the ContinuousBatchingKVCacheCollection
for those sequences.
### `contains()` \{#max.pipelines.kv_cache.manager.KVCacheManager.contains}
> contains(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [bool](https://docs.python.org/3/library/functions.html#bool)
### `estimated_memory_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.estimated_memory_size}
> *abstract classmethod* estimated_memory_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)
Returns the estimated total memory usage of the kv cache.
### `external_claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.external_claim}
> external_claim(seq_ids: [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]) → [None](https://docs.python.org/3/library/constants.html#None)
Variant of the above where sequence ids are reserved externally.
### `fetch()` \{#max.pipelines.kv_cache.manager.KVCacheManager.fetch}
> *final* fetch(seq_ids_and_prompts: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)], num_steps: [int](https://docs.python.org/3/library/functions.html#int) = 1) → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]
Returns blocks and other inputs to kv cache kernel for given
sequence ids and prompts.
### `increment_cache_lengths()` \{#max.pipelines.kv_cache.manager.KVCacheManager.increment_cache_lengths}
> increment_cache_lengths(kv_cache_inputs: [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]], prev_model_inputs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]
Prepare the inputs for a multistep execution, generally by incrementing
the cache lengths. This should not require a device synchronization,
as this would defeat the purpose of multistep execution.
This should also not update the cache lengths in our manager, this batch is
still considered in-progress.
### `infer_optimal_batch_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.infer_optimal_batch_size}
> *abstract classmethod* infer_optimal_batch_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)
Returns the estimated optimal batch size for the kv cache.
### `input_symbols()` \{#max.pipelines.kv_cache.manager.KVCacheManager.input_symbols}
> *abstract* input_symbols() → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[KVCacheInputSymbols](#max.pipelines.kv_cache.manager.KVCacheInputSymbols)]
Returns the input symbols for the kv cache manager.
### `max_sequence_length` \{#max.pipelines.kv_cache.manager.KVCacheManager.max_sequence_length}
> *property* max_sequence_length*: [int](https://docs.python.org/3/library/functions.html#int)*
The maximum sequence length in current cache.
### `num_kv_inputs()` \{#max.pipelines.kv_cache.manager.KVCacheManager.num_kv_inputs}
> num_kv_inputs() → [int](https://docs.python.org/3/library/functions.html#int)
Returns the default number of KV cache inputs for KV managers.
Subclasses with a different number of KV cache inputs should override
this method and increment_cache_lengths.
### `release()` \{#max.pipelines.kv_cache.manager.KVCacheManager.release}
> release(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [None](https://docs.python.org/3/library/constants.html#None)
Release seq_id provided, marking this sequence as complete.
This returns the seq_id back to the available pool of cache memory,
allowing it to be reused when a new sequence is claimed.
### `slots_remaining` \{#max.pipelines.kv_cache.manager.KVCacheManager.slots_remaining}
> *property* slots_remaining*: [set](https://docs.python.org/3/library/stdtypes.html#set)[[int](https://docs.python.org/3/library/functions.html#int)]*
The outstanding cache slots available.
### `step()` \{#max.pipelines.kv_cache.manager.KVCacheManager.step}
> step(seq_ids_and_new_tokens: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)]) → [None](https://docs.python.org/3/library/constants.html#None)
Update the cache_lengths objects to note that a new
kv projection step has occurred, and that the underlying memory
has been written to. This cache_lengths value is then used
downstream in fetch to track what section of memory should
be used in the kernels.
## `KVCacheManager` \{#max.pipelines.kv_cache.manager.KVCacheManager}
> *class* max.pipelines.kv_cache.manager.KVCacheManager(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], session: [InferenceSession](../../engine.md#max.engine.InferenceSession), is_ragged: [bool](https://docs.python.org/3/library/functions.html#bool) = False)
### `claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.claim}
> claim(n: [int](https://docs.python.org/3/library/functions.html#int)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]
Claims n blocks of memory in the cache for incoming requests.
This returns a list of sequence ids, which identify a sequence’s
location within the cache. This sequence id can then be passed
in the fetch function to return the ContinuousBatchingKVCacheCollection
for those sequences.
### `contains()` \{#max.pipelines.kv_cache.manager.KVCacheManager.contains}
> contains(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [bool](https://docs.python.org/3/library/functions.html#bool)
### `estimated_memory_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.estimated_memory_size}
> *abstract classmethod* estimated_memory_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)
Returns the estimated total memory usage of the kv cache.
### `external_claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.external_claim}
> external_claim(seq_ids: [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]) → [None](https://docs.python.org/3/library/constants.html#None)
Variant of the above where sequence ids are reserved externally.
### `fetch()` \{#max.pipelines.kv_cache.manager.KVCacheManager.fetch}
> *final* fetch(seq_ids_and_prompts: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)], num_steps: [int](https://docs.python.org/3/library/functions.html#int) = 1) → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]
Returns blocks and other inputs to kv cache kernel for given
sequence ids and prompts.
### `increment_cache_lengths()` \{#max.pipelines.kv_cache.manager.KVCacheManager.increment_cache_lengths}
> increment_cache_lengths(kv_cache_inputs: [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]], prev_model_inputs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]
Prepare the inputs for a multistep execution, generally by incrementing
the cache lengths. This should not require a device synchronization,
as this would defeat the purpose of multistep execution.
This should also not update the cache lengths in our manager, this batch is
still considered in-progress.
### `infer_optimal_batch_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.infer_optimal_batch_size}
> *abstract classmethod* infer_optimal_batch_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)
Returns the estimated optimal batch size for the kv cache.
### `input_symbols()` \{#max.pipelines.kv_cache.manager.KVCacheManager.input_symbols}
> *abstract* input_symbols() → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[KVCacheInputSymbols](#max.pipelines.kv_cache.manager.KVCacheInputSymbols)]
Returns the input symbols for the kv cache manager.
### `max_sequence_length` \{#max.pipelines.kv_cache.manager.KVCacheManager.max_sequence_length}
> *property* max_sequence_length*: [int](https://docs.python.org/3/library/functions.html#int)*
The maximum sequence length in current cache.
### `num_kv_inputs()` \{#max.pipelines.kv_cache.manager.KVCacheManager.num_kv_inputs}
> num_kv_inputs() → [int](https://docs.python.org/3/library/functions.html#int)
Returns the default number of KV cache inputs for KV managers.
Subclasses with a different number of KV cache inputs should override
this method and increment_cache_lengths.
### `release()` \{#max.pipelines.kv_cache.manager.KVCacheManager.release}
> release(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [None](https://docs.python.org/3/library/constants.html#None)
Release seq_id provided, marking this sequence as complete.
This returns the seq_id back to the available pool of cache memory,
allowing it to be reused when a new sequence is claimed.
### `slots_remaining` \{#max.pipelines.kv_cache.manager.KVCacheManager.slots_remaining}
> *property* slots_remaining*: [set](https://docs.python.org/3/library/stdtypes.html#set)[[int](https://docs.python.org/3/library/functions.html#int)]*
The outstanding cache slots available.
### `step()` \{#max.pipelines.kv_cache.manager.KVCacheManager.step}
> step(seq_ids_and_new_tokens: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)]) → [None](https://docs.python.org/3/library/constants.html#None)
Update the cache_lengths objects to note that a new
kv projection step has occurred, and that the underlying memory
has been written to. This cache_lengths value is then used
downstream in fetch to track what section of memory should
be used in the kernels.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!