Skip to main content

/

Python module

max.nn.kv_cache

Cache configuration

`KVCacheBuffer`	This is a collection of the KVCache buffers.
`KVCacheParamInterface`	Interface for KV cache parameters.
`KVCacheParams`	Configuration parameters for key-value cache management in transformer models.
`KVCacheQuantizationConfig`	Configuration for KVCache quantization.
`MultiKVCacheParams`	Aggregates multiple KV cache parameter sets.

Cache inputs

`KVCacheInputs`	Symbolic graph input types for all devices' paged KV cache.
`KVCacheInputsPerDevice`	Symbolic graph input types for a single device's paged KV cache.
`PagedCacheValues`	alias of `KVCacheInputsPerDevice`[`TensorValue`, `BufferValue`]

Attention dispatch

`AttentionDispatchResolver`	Resolves packed attention decode metadata via kernel custom ops.

Metrics

`KVCacheMetrics`	Metrics for the KV cache.

Functions

`build_max_lengths_tensor`	Builds a `[num_steps, 2]` uint32 buffer of per-step maximum lengths.
`compute_max_seq_len_fitting_in_cache`	Computes the maximum sequence length that can fit in the available memory.
`compute_num_device_blocks`	Computes the number of blocks that can be allocated based on the available cache memory.
`compute_num_host_blocks`	Computes the number of blocks that can be allocated on the host.
`estimated_memory_size`	Computes the estimated memory size of the KV cache used by all replicas.

Cache configuration
Cache inputs
Attention dispatch
Metrics
Functions