Skip to main content

Python module

max.nn.kv_cache

Cache configuration

KVCacheBufferThis is a collection of the KVCache buffers.
KVCacheParamInterfaceInterface for KV cache parameters.
KVCacheParamsConfiguration parameters for key-value cache management in transformer models.
KVCacheQuantizationConfigConfiguration for KVCache quantization.
MultiKVCacheParamsAggregates multiple KV cache parameter sets.

Cache inputs

KVCacheInputsSymbolic graph input types for all devices' paged KV cache.
KVCacheInputsPerDeviceSymbolic graph input types for a single device's paged KV cache.
PagedCacheValuesalias of KVCacheInputsPerDevice[TensorValue, BufferValue]

Attention dispatch

AttentionDispatchResolverResolves packed attention decode metadata via kernel custom ops.

Metrics

KVCacheMetricsMetrics for the KV cache.

Functions

build_max_lengths_tensorBuilds a [num_steps, 2] uint32 buffer of per-step maximum lengths.
compute_max_seq_len_fitting_in_cacheComputes the maximum sequence length that can fit in the available memory.
compute_num_device_blocksComputes the number of blocks that can be allocated based on the available cache memory.
compute_num_host_blocksComputes the number of blocks that can be allocated on the host.
estimated_memory_sizeComputes the estimated memory size of the KV cache used by all replicas.