For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python function

compute_num_device_blocks

`compute_num_device_blocks()`

max.nn.kv_cache.compute_num_device_blocks(params, available_cache_memory, max_batch_size, max_seq_len, require_max_seq_len_fits=False)

source

Computes the number of blocks that can be allocated based on the available cache memory.

The number of blocks returned is for a single replica. Each replica will have the same number of blocks.

Parameters:

available_cache_memory (int) – The amount of cache memory available across all devices.
max_batch_size (int | None) – The maximum batch size, or None.
max_seq_len (int | None) – The maximum sequence length, or None.
require_max_seq_len_fits (bool) – When True, raise instead of warn if a single request at max_seq_len cannot fit in the allocable device blocks. Memory estimation deliberately probes oversized configs, so only the actual cache-allocation path should set this.
params (KVCacheParamInterface)

Returns:

The number of blocks that can be allocated for a single replica.

Return type:

int

compute_num_device_blocks()​

`compute_num_device_blocks()`