For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
KVCacheBuffer
KVCacheBufferβ
class max.nn.kv_cache.KVCacheBuffer(replicates_kv_across_tp, values, scales=None)
Bases: object
A collection of KVCache buffers for one data-parallel replica.
Two buffer kinds are supported: values and (optionally, for FP8
quantization) scales. The length of each list corresponds to the
tensor-parallel degree, with one buffer per TP shard.
page_size and replicates_kv_across_tp describe the physical layout
so KV connectors can offload this cache without a separate
KVCacheParams reference: replicates_kv_across_tp is True when
the KV data is replicated identically across TP shards (MLA) and False
when it is sharded (MHA).
all_buffersβ
Returns all value and scale buffers in a single flat list.
-
Returns:
-
A list containing every value buffer followed by every scale buffer (if scales are present).
replicates_kv_across_tpβ
replicates_kv_across_tp: bool
scalesβ
to_memory()β
to_memory()
Convert to a flat list of offload-ready memory units.
Each unit covers one buffer kind (values or scales) and one
logical TP group. Non-replicated shards become individual
KVCacheMemory entries; replicated shards become one
ReplicatedKVCacheMemory entry (root + peers).
Every buffer is re-viewed as a 2-D [num_pages, bytes_per_page]
uint8 array so the offload engine can treat all caches
uniformly regardless of original dtype or shape.
-
Returns:
-
A list of memory units ready for use by KV connectors and the offload engine.
-
Return type:
-
list[KVCacheMemory]
total_num_pagesβ
property total_num_pages: int
Returns the total number of pages across all values and scales.
valuesβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!