Python class
KVCacheBuffer
KVCacheBuffer
class max.nn.kv_cache.KVCacheBuffer(total_num_pages, values, scales=None)
Bases: object
This is a collection of the KVCache buffers.
There are two types of supported buffers today: values and scales. The scales are optional and used for FP8 quantization.
The length of the list of buffers correspond to the tensor parallel degree where each buffer in the list corresponds to a single TP shard.
For DP, we would have multiple instances of KVCacheBuffer per replica.
all_buffers
Returns all value and scale buffers in a single flat list.
-
Returns:
-
A list containing every value buffer followed by every scale buffer (if scales are present).
allocate_host_offload_buffer()
allocate_host_offload_buffer(total_num_host_pages)
Allocates a KVCacheBuffer for host offloading.
The allocated buffer will have the same characteristics as the original buffer, apart from the total_num_pages and the location. The host offload buffer will be allocated on DevicePinnedBuffer for fast transfer speeds.
-
Parameters:
-
total_num_host_pages (int)
-
Return type:
scales
total_num_pages
total_num_pages: int
values
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!