For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

KVCacheInputsPerDevice

`KVCacheInputsPerDevice`

class max.nn.kv_cache.KVCacheInputsPerDevice(kv_blocks, cache_lengths, lookup_table, max_prompt_length, max_cache_length, kv_scales=None, attention_dispatch_metadata=None, draft_attention_dispatch_metadata=None, mla_num_partitions=None, draft_mla_num_partitions=None, kv_blocks_per_layer=None)

source

Bases: Generic[_Tensor, _Buffer]

Symbolic graph input types for a single device’s paged KV cache.

Parameters:

kv_blocks (_Buffer)
cache_lengths (_Tensor)
lookup_table (_Tensor)
max_prompt_length (_Tensor)
max_cache_length (_Tensor)
kv_scales (_Buffer | None)
attention_dispatch_metadata (_Tensor | None)
draft_attention_dispatch_metadata (_Tensor | None)
mla_num_partitions (_Tensor | None)
draft_mla_num_partitions (_Tensor | None)
kv_blocks_per_layer (list[_Buffer] | None)

`attention_dispatch_metadata`

attention_dispatch_metadata: _Tensor | None = None

source

`cache_lengths`

cache_lengths: _Tensor

source

`draft_attention_dispatch_metadata`

draft_attention_dispatch_metadata: _Tensor | None = None

source

`draft_mla_num_partitions`

draft_mla_num_partitions: _Tensor | None = None

source

`flatten()`

flatten()

source

Serialize fields into a flat list for graph input binding.

Return type:: list[_Tensor | _Buffer]

`flatten_without_attention_dispatch_metadata()`

flatten_without_attention_dispatch_metadata()

source

Return type:: list[_Tensor | _Buffer]

`kv_blocks`

kv_blocks: _Buffer

source

`kv_blocks_per_layer`

kv_blocks_per_layer: list[_Buffer] | None = None

source

`kv_scales`

kv_scales: _Buffer | None = None

source

`lookup_table`

lookup_table: _Tensor

source

`max_cache_length`

max_cache_length: _Tensor

source

`max_prompt_length`

max_prompt_length: _Tensor

source

`mla_num_partitions`

mla_num_partitions: _Tensor | None = None

source

`unflatten()`

unflatten(it)

source

Reconstruct from a flat iterator produced by flatten.

Consumes next(it) in the same order flatten emits elements; the two methods must stay in lock-step.

Parameters:: it (Iterator[Any])
Return type:: KVCacheInputsPerDevice[TensorValue, BufferValue]

KVCacheInputsPerDevice​

attention_dispatch_metadata​

cache_lengths​

draft_attention_dispatch_metadata​

draft_mla_num_partitions​

flatten()​

flatten_without_attention_dispatch_metadata()​

kv_blocks​

kv_blocks_per_layer​

kv_scales​

lookup_table​

max_cache_length​

max_prompt_length​

mla_num_partitions​

unflatten()​