Skip to main content

Python module

max.nn.kv_cache

Cache configuration

KVCacheBufferThis is a collection of the KVCache buffers.
KVCacheParamInterfaceInterface for KV cache parameters.
KVCacheParamsConfiguration parameters for key-value cache management in transformer models.
KVCacheQuantizationConfigConfiguration for KVCache quantization.
MultiKVCacheParamsAggregates multiple KV cache parameter sets.

Cache inputs

AttentionDispatchMetadataWraps the scalar attention dispatch metadata tensor for a single device.
KVCacheInputsKVCacheInputs is a sequence of KVCacheInputsPerDevice.
KVCacheInputsPerDeviceHolds the concrete KV cache buffer inputs for a single device.
NestedIterableDataclassBase class for input symbols for KV cache managers.
PagedCacheValuesConcrete graph values for a single device's paged KV cache.

Attention dispatch

AttentionDispatchResolverResolves packed attention decode metadata via kernel custom ops.

Metrics

KVCacheMetricsMetrics for the KV cache.

Functions

attention_dispatch_metadataExtracts the AttentionDispatchMetadata from a KV collection.
attention_dispatch_metadata_listExtracts AttentionDispatchMetadata from each KV collection.
build_max_lengths_tensorBuilds a [num_steps, 2] uint32 buffer of per-step maximum lengths.
compute_max_seq_len_fitting_in_cacheComputes the maximum sequence length that can fit in the available memory.
compute_num_device_blocksComputes the number of blocks that can be allocated based on the available cache memory.
compute_num_host_blocksComputes the number of blocks that can be allocated on the host.
estimated_memory_sizeComputes the estimated memory size of the KV cache used by all replicas.
unflatten_ragged_attention_inputsUnmarshals flattened KV graph inputs into typed cache values.