IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

KVCacheInputsPerDevice

KVCacheInputsPerDevice​

class max.nn.kv_cache.KVCacheInputsPerDevice(kv_blocks, cache_lengths, lookup_table, max_lengths, kv_scales=None, attention_dispatch_metadata=None, draft_attention_dispatch_metadata=None, mla_num_partitions=None, draft_mla_num_partitions=None)

source

Bases: Generic[_Tensor, _Buffer]

Symbolic graph input types for a single device’s paged KV cache.

Parameters:

  • kv_blocks (_Buffer)
  • cache_lengths (_Tensor)
  • lookup_table (_Tensor)
  • max_lengths (_Tensor)
  • kv_scales (_Buffer | None)
  • attention_dispatch_metadata (_Tensor | None)
  • draft_attention_dispatch_metadata (_Tensor | None)
  • mla_num_partitions (_Tensor | None)
  • draft_mla_num_partitions (_Tensor | None)

attention_dispatch_metadata​

attention_dispatch_metadata: _Tensor | None = None

source

cache_lengths​

cache_lengths: _Tensor

source

draft_attention_dispatch_metadata​

draft_attention_dispatch_metadata: _Tensor | None = None

source

draft_mla_num_partitions​

draft_mla_num_partitions: _Tensor | None = None

source

flatten()​

flatten()

source

Serialize fields into a flat list for graph input binding.

Ordering: [kv_blocks, cache_lengths, lookup_table, max_lengths, kv_scales?, attention_dispatch_metadata?, draft_attention_dispatch_metadata?, mla_num_partitions?, draft_mla_num_partitions?]. Fields marked ? emit zero elements when None; unflatten must consume next(it) in this exact order.

Return type:

list[_Tensor | _Buffer]

flatten_without_attention_dispatch_metadata()​

flatten_without_attention_dispatch_metadata()

source

Return type:

list[_Tensor | _Buffer]

kv_blocks​

kv_blocks: _Buffer

source

kv_scales​

kv_scales: _Buffer | None = None

source

lookup_table​

lookup_table: _Tensor

source

max_lengths​

max_lengths: _Tensor

source

mla_num_partitions​

mla_num_partitions: _Tensor | None = None

source

unflatten()​

unflatten(it)

source

Reconstruct from a flat iterator produced by flatten.

Consumes next(it) in the same order flatten emits elements; the two methods must stay in lock-step.

Parameters:

it (Iterator[Any])

Return type:

KVCacheInputsPerDevice[TensorValue, BufferValue]