Python module
max.nn.kv_cache
Cache configuration
KVCacheBuffer | This is a collection of the KVCache buffers. |
|---|---|
KVCacheParamInterface | Interface for KV cache parameters. |
KVCacheParams | Configuration parameters for key-value cache management in transformer models. |
KVCacheQuantizationConfig | Configuration for KVCache quantization. |
MultiKVCacheParams | Aggregates multiple KV cache parameter sets. |
Cache inputs
KVCacheInputs | Symbolic graph input types for all devices' paged KV cache. |
|---|---|
KVCacheInputsPerDevice | Symbolic graph input types for a single device's paged KV cache. |
PagedCacheValues | alias of KVCacheInputsPerDevice[TensorValue, BufferValue] |
Attention dispatch
AttentionDispatchResolver | Resolves packed attention decode metadata via kernel custom ops. |
|---|
Metrics
KVCacheMetrics | Metrics for the KV cache. |
|---|
Functions
build_max_lengths_tensor | Builds a [num_steps, 2] uint32 buffer of per-step maximum lengths. |
|---|---|
compute_max_seq_len_fitting_in_cache | Computes the maximum sequence length that can fit in the available memory. |
compute_num_device_blocks | Computes the number of blocks that can be allocated based on the available cache memory. |
compute_num_host_blocks | Computes the number of blocks that can be allocated on the host. |
estimated_memory_size | Computes the estimated memory size of the KV cache used by all replicas. |
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!