Skip to main content

Python module

max.nn.kv_cache

Cache configuration​

KVCacheBufferThis is a collection of the KVCache buffers.
KVCacheParamInterfaceInterface for KV cache parameters.
KVCacheParamsConfiguration parameters for key-value cache management in transformer models.
KVCacheQuantizationConfigConfiguration for KVCache quantization.
MultiKVCacheParamsAggregates multiple KV cache parameter sets.

Cache inputs​

KVCacheInputsSymbolic graph input types for all devices' paged KV cache.
KVCacheInputsPerDeviceSymbolic graph input types for a single device's paged KV cache.
PagedCacheValuesalias of KVCacheInputsPerDevice[TensorValue, BufferValue]

Attention dispatch​

AttentionDispatchResolverResolves packed attention decode metadata via kernel custom ops.

Metrics​

KVCacheMetricsMetrics for the KV cache.

Functions​

build_max_lengths_tensorBuilds a [num_steps, 2] uint32 buffer of per-step maximum lengths.
compute_max_seq_len_fitting_in_cacheComputes the maximum sequence length that can fit in the available memory.
compute_num_device_blocksComputes the number of blocks that can be allocated based on the available cache memory.
compute_num_host_blocksComputes the number of blocks that can be allocated on the host.
estimated_memory_sizeComputes the estimated memory size of the KV cache used by all replicas.