For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
KVCacheBuffer
KVCacheBufferβ
class max.nn.kv_cache.KVCacheBuffer(total_num_pages, values, scales=None)
Bases: object
This is a collection of the KVCache buffers.
There are two types of supported buffers today: values and scales. The scales are optional and used for FP8 quantization.
The length of the list of buffers correspond to the tensor parallel degree where each buffer in the list corresponds to a single TP shard.
For DP, we would have multiple instances of KVCacheBuffer per replica.
all_buffersβ
Returns all value and scale buffers in a single flat list.
-
Returns:
-
A list containing every value buffer followed by every scale buffer (if scales are present).
allocate_host_offload_buffer()β
allocate_host_offload_buffer(total_num_host_pages)
Allocates a KVCacheBuffer for host offloading.
The allocated buffer will have the same characteristics as the original buffer, apart from the total_num_pages and the location. The host offload buffer will be allocated on DevicePinnedBuffer for fast transfer speeds.
-
Parameters:
-
total_num_host_pages (int)
-
Return type:
scalesβ
total_num_pagesβ
total_num_pages: int
valuesβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!