Python module

naive_cache

Naive KV cache for the Transformer.

`NaiveKVCacheInputSymbols`

class max.pipelines.kv_cache.naive_cache.NaiveKVCacheInputSymbols(k_cache: max.graph.type.BufferType, v_cache: max.graph.type.BufferType, start_pos: max.graph.type.TensorType, null_op: max.graph.type.TensorType)

`k_cache`

k_cache*: BufferType*

`null_op`

null_op*: TensorType*

`start_pos`

start_pos*: TensorType*

`v_cache`

v_cache*: BufferType*

`NaiveKVCacheManager`

class max.pipelines.kv_cache.naive_cache.NaiveKVCacheManager(params: KVCacheParams, max_batch_size: int, max_seq_len: int, num_layers: int, devices: list[max._core.driver.Device], session: InferenceSession)

`cache_shape`

property cache_shape*: list[int]*

`estimated_memory_size()`

classmethod estimated_memory_size(params: KVCacheParams, max_batch_size: int, max_seq_len: int, num_layers: int, available_cache_memory: int, devices: list[max._core.driver.Device], **kwargs: Any) → int

Returns the estimated total memory usage of the kv cache.

`fetch()`

fetch(batch: list[max.pipelines.context.context.InputContext], num_steps: int = 1) → list[max.pipelines.kv_cache.manager.KVCacheInputs]

Returns blocks and other inputs to kv cache kernel for given sequence ids and prompts.

`infer_optimal_batch_size()`

classmethod infer_optimal_batch_size(params: KVCacheParams, max_seq_len: int, num_layers: int, available_cache_memory: int, devices: list[max._core.driver.Device], **kwargs: Any) → int

Returns the estimated optimal batch size for the kv cache.

`input_symbols()`

input_symbols() → list[max.pipelines.kv_cache.naive_cache.NaiveKVCacheInputSymbols]

Returns the input symbols for the kv cache manager.

NaiveKVCacheInputSymbols​

k_cache​

null_op​

start_pos​

v_cache​

NaiveKVCacheManager​

cache_shape​

estimated_memory_size()​

fetch()​

infer_optimal_batch_size()​

input_symbols()​