Skip to main content
Log in

Python module

manager

Abstract base class for KVCacheManager for KV Cache.

KVCacheInputSymbols

class max.pipelines.kv_cache.manager.KVCacheInputSymbols

Base class for input symbols for KV cache managers.

The derived class is responsible for defining the input symbols for the specific KV cache manager.

For example, here’s a derived class for a text KV cache manager: : ```pycon

@dataclass ... class ContinuousBatchingKVCacheInputSymbols(KVCacheInputSymbols): ... kv_blocks: TensorType ... cache_lengths: TensorType ... lookup_table: TensorType ... max_lengths: TensorType


## `KVCacheManager` \{#max.pipelines.kv_cache.manager.KVCacheManager}

> *class* max.pipelines.kv_cache.manager.KVCacheManager(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], session: [InferenceSession](../../engine.md#max.engine.InferenceSession), is_ragged: [bool](https://docs.python.org/3/library/functions.html#bool) = False)

### `claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.claim}

> claim(n: [int](https://docs.python.org/3/library/functions.html#int)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]

Claims n blocks of memory in the cache for incoming requests.

This returns a list of sequence ids, which identify a sequence’s
location within the cache. This sequence id can then be passed
in the fetch function to return the ContinuousBatchingKVCacheCollection
for those sequences.

### `contains()` \{#max.pipelines.kv_cache.manager.KVCacheManager.contains}

> contains(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [bool](https://docs.python.org/3/library/functions.html#bool)

### `estimated_memory_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.estimated_memory_size}

> *abstract classmethod* estimated_memory_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)

Returns the estimated total memory usage of the kv cache.

### `external_claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.external_claim}

> external_claim(seq_ids: [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]) → [None](https://docs.python.org/3/library/constants.html#None)

Variant of the above where sequence ids are reserved externally.

### `fetch()` \{#max.pipelines.kv_cache.manager.KVCacheManager.fetch}

> *final* fetch(seq_ids_and_prompts: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)], num_steps: [int](https://docs.python.org/3/library/functions.html#int) = 1) → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]

Returns blocks and other inputs to kv cache kernel for given
sequence ids and prompts.

### `increment_cache_lengths()` \{#max.pipelines.kv_cache.manager.KVCacheManager.increment_cache_lengths}

> increment_cache_lengths(kv_cache_inputs: [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]], prev_model_inputs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]

Prepare the inputs for a multistep execution, generally by incrementing
the cache lengths. This should not require a device synchronization,
as this would defeat the purpose of multistep execution.

This should also not update the cache lengths in our manager, this batch is
still considered in-progress.

### `infer_optimal_batch_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.infer_optimal_batch_size}

> *abstract classmethod* infer_optimal_batch_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)

Returns the estimated optimal batch size for the kv cache.

### `input_symbols()` \{#max.pipelines.kv_cache.manager.KVCacheManager.input_symbols}

> *abstract* input_symbols() → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[KVCacheInputSymbols](#max.pipelines.kv_cache.manager.KVCacheInputSymbols)]

Returns the input symbols for the kv cache manager.

### `max_sequence_length` \{#max.pipelines.kv_cache.manager.KVCacheManager.max_sequence_length}

> *property* max_sequence_length*: [int](https://docs.python.org/3/library/functions.html#int)*

The maximum sequence length in current cache.

### `num_kv_inputs()` \{#max.pipelines.kv_cache.manager.KVCacheManager.num_kv_inputs}

> num_kv_inputs() → [int](https://docs.python.org/3/library/functions.html#int)

Returns the default number of KV cache inputs for KV managers.

Subclasses with a different number of KV cache inputs should override
this method and increment_cache_lengths.

### `release()` \{#max.pipelines.kv_cache.manager.KVCacheManager.release}

> release(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [None](https://docs.python.org/3/library/constants.html#None)

Release seq_id provided, marking this sequence as complete.
This returns the seq_id back to the available pool of cache memory,
allowing it to be reused when a new sequence is claimed.

### `slots_remaining` \{#max.pipelines.kv_cache.manager.KVCacheManager.slots_remaining}

> *property* slots_remaining*: [set](https://docs.python.org/3/library/stdtypes.html#set)[[int](https://docs.python.org/3/library/functions.html#int)]*

The outstanding cache slots available.

### `step()` \{#max.pipelines.kv_cache.manager.KVCacheManager.step}

> step(seq_ids_and_new_tokens: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)]) → [None](https://docs.python.org/3/library/constants.html#None)

Update the cache_lengths objects to note that a new
kv projection step has occurred, and that the underlying memory
has been written to. This cache_lengths value is then used
downstream in fetch to track what section of memory should
be used in the kernels.

## `KVCacheManager` \{#max.pipelines.kv_cache.manager.KVCacheManager}

> *class* max.pipelines.kv_cache.manager.KVCacheManager(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], session: [InferenceSession](../../engine.md#max.engine.InferenceSession), is_ragged: [bool](https://docs.python.org/3/library/functions.html#bool) = False)

### `claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.claim}

> claim(n: [int](https://docs.python.org/3/library/functions.html#int)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]

Claims n blocks of memory in the cache for incoming requests.

This returns a list of sequence ids, which identify a sequence’s
location within the cache. This sequence id can then be passed
in the fetch function to return the ContinuousBatchingKVCacheCollection
for those sequences.

### `contains()` \{#max.pipelines.kv_cache.manager.KVCacheManager.contains}

> contains(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [bool](https://docs.python.org/3/library/functions.html#bool)

### `estimated_memory_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.estimated_memory_size}

> *abstract classmethod* estimated_memory_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_batch_size: [int](https://docs.python.org/3/library/functions.html#int), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)

Returns the estimated total memory usage of the kv cache.

### `external_claim()` \{#max.pipelines.kv_cache.manager.KVCacheManager.external_claim}

> external_claim(seq_ids: [List](https://docs.python.org/3/library/typing.html#typing.List)[[int](https://docs.python.org/3/library/functions.html#int)]) → [None](https://docs.python.org/3/library/constants.html#None)

Variant of the above where sequence ids are reserved externally.

### `fetch()` \{#max.pipelines.kv_cache.manager.KVCacheManager.fetch}

> *final* fetch(seq_ids_and_prompts: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)], num_steps: [int](https://docs.python.org/3/library/functions.html#int) = 1) → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]

Returns blocks and other inputs to kv cache kernel for given
sequence ids and prompts.

### `increment_cache_lengths()` \{#max.pipelines.kv_cache.manager.KVCacheManager.increment_cache_lengths}

> increment_cache_lengths(kv_cache_inputs: [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]], prev_model_inputs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [List](https://docs.python.org/3/library/typing.html#typing.List)[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)[[max.driver.tensor.Tensor](../../driver.md#max.driver.Tensor), ...]]

Prepare the inputs for a multistep execution, generally by incrementing
the cache lengths. This should not require a device synchronization,
as this would defeat the purpose of multistep execution.

This should also not update the cache lengths in our manager, this batch is
still considered in-progress.

### `infer_optimal_batch_size()` \{#max.pipelines.kv_cache.manager.KVCacheManager.infer_optimal_batch_size}

> *abstract classmethod* infer_optimal_batch_size(params: [KVCacheParams](cache_params.md#max.pipelines.kv_cache.cache_params.KVCacheParams), max_seq_len: [int](https://docs.python.org/3/library/functions.html#int), num_layers: [int](https://docs.python.org/3/library/functions.html#int), available_cache_memory: [int](https://docs.python.org/3/library/functions.html#int), devices: [List](https://docs.python.org/3/library/typing.html#typing.List)[[Device](../../driver.md#max.driver.Device)], \*\*kwargs: [Any](https://docs.python.org/3/library/typing.html#typing.Any)) → [int](https://docs.python.org/3/library/functions.html#int)

Returns the estimated optimal batch size for the kv cache.

### `input_symbols()` \{#max.pipelines.kv_cache.manager.KVCacheManager.input_symbols}

> *abstract* input_symbols() → [Sequence](https://docs.python.org/3/library/typing.html#typing.Sequence)[[KVCacheInputSymbols](#max.pipelines.kv_cache.manager.KVCacheInputSymbols)]

Returns the input symbols for the kv cache manager.

### `max_sequence_length` \{#max.pipelines.kv_cache.manager.KVCacheManager.max_sequence_length}

> *property* max_sequence_length*: [int](https://docs.python.org/3/library/functions.html#int)*

The maximum sequence length in current cache.

### `num_kv_inputs()` \{#max.pipelines.kv_cache.manager.KVCacheManager.num_kv_inputs}

> num_kv_inputs() → [int](https://docs.python.org/3/library/functions.html#int)

Returns the default number of KV cache inputs for KV managers.

Subclasses with a different number of KV cache inputs should override
this method and increment_cache_lengths.

### `release()` \{#max.pipelines.kv_cache.manager.KVCacheManager.release}

> release(seq_id: [int](https://docs.python.org/3/library/functions.html#int)) → [None](https://docs.python.org/3/library/constants.html#None)

Release seq_id provided, marking this sequence as complete.
This returns the seq_id back to the available pool of cache memory,
allowing it to be reused when a new sequence is claimed.

### `slots_remaining` \{#max.pipelines.kv_cache.manager.KVCacheManager.slots_remaining}

> *property* slots_remaining*: [set](https://docs.python.org/3/library/stdtypes.html#set)[[int](https://docs.python.org/3/library/functions.html#int)]*

The outstanding cache slots available.

### `step()` \{#max.pipelines.kv_cache.manager.KVCacheManager.step}

> step(seq_ids_and_new_tokens: [dict](https://docs.python.org/3/library/stdtypes.html#dict)[[int](https://docs.python.org/3/library/functions.html#int), [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)]) → [None](https://docs.python.org/3/library/constants.html#None)

Update the cache_lengths objects to note that a new
kv projection step has occurred, and that the underlying memory
has been written to. This cache_lengths value is then used
downstream in fetch to track what section of memory should
be used in the kernels.