Skip to main content

Python class

KVCacheBuffer

KVCacheBuffer

class max.nn.kv_cache.KVCacheBuffer(total_num_pages, values, scales=None)

source

Bases: object

This is a collection of the KVCache buffers.

There are two types of supported buffers today: values and scales. The scales are optional and used for FP8 quantization.

The length of the list of buffers correspond to the tensor parallel degree where each buffer in the list corresponds to a single TP shard.

For DP, we would have multiple instances of KVCacheBuffer per replica.

Parameters:

all_buffers

property all_buffers: list[Buffer]

source

Returns all value and scale buffers in a single flat list.

Returns:

A list containing every value buffer followed by every scale buffer (if scales are present).

allocate_host_offload_buffer()

allocate_host_offload_buffer(total_num_host_pages)

source

Allocates a KVCacheBuffer for host offloading.

The allocated buffer will have the same characteristics as the original buffer, apart from the total_num_pages and the location. The host offload buffer will be allocated on DevicePinnedBuffer for fast transfer speeds.

Parameters:

total_num_host_pages (int)

Return type:

KVCacheBuffer

scales

scales: list[Buffer] | None = None

source

total_num_pages

total_num_pages: int

source

values

values: list[Buffer]

source