Python module
max.nn.kv_cache
Cache configuration
KVCacheBuffer | This is a collection of the KVCache buffers. |
|---|---|
KVCacheParamInterface | Interface for KV cache parameters. |
KVCacheParams | Configuration parameters for key-value cache management in transformer models. |
KVCacheQuantizationConfig | Configuration for KVCache quantization. |
MultiKVCacheParams | Aggregates multiple KV cache parameter sets. |
Cache inputs
AttentionDispatchMetadata | Wraps the scalar attention dispatch metadata tensor for a single device. |
|---|---|
KVCacheInputs | KVCacheInputs is a sequence of KVCacheInputsPerDevice. |
KVCacheInputsPerDevice | Holds the concrete KV cache buffer inputs for a single device. |
NestedIterableDataclass | Base class for input symbols for KV cache managers. |
PagedCacheValues | Concrete graph values for a single device's paged KV cache. |
Attention dispatch
AttentionDispatchResolver | Resolves packed attention decode metadata via kernel custom ops. |
|---|
Metrics
KVCacheMetrics | Metrics for the KV cache. |
|---|
Functions
attention_dispatch_metadata | Extracts the AttentionDispatchMetadata from a KV collection. |
|---|---|
attention_dispatch_metadata_list | Extracts AttentionDispatchMetadata from each KV collection. |
build_max_lengths_tensor | Builds a [num_steps, 2] uint32 buffer of per-step maximum lengths. |
compute_max_seq_len_fitting_in_cache | Computes the maximum sequence length that can fit in the available memory. |
compute_num_device_blocks | Computes the number of blocks that can be allocated based on the available cache memory. |
compute_num_host_blocks | Computes the number of blocks that can be allocated on the host. |
estimated_memory_size | Computes the estimated memory size of the KV cache used by all replicas. |
unflatten_ragged_attention_inputs | Unmarshals flattened KV graph inputs into typed cache values. |
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!