For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

PagedMemoryPlanner

`PagedMemoryPlanner`

class max.pipelines.kv_cache.PagedMemoryPlanner(config)

source

Bases: MemoryPlanner

Memory planner for models that use a paged KV cache.

This is the standard planner for autoregressive text-generation models. It delegates KV-parameter queries to the model config via the ModelConfigWithKVCache protocol.

For models that require a fixed activation-memory reservation (e.g. VLMs that need headroom for vision processing), use with_activation_reservation() to create a pre-configured subclass instead of writing a custom MemoryPlanner:

memory_planner=PagedMemoryPlanner.with_activation_reservation(
    15 * 1024**3
)

Parameters:: config (Any) – Model configuration that implements ModelConfigWithKVCache (i.e. exposes both devices and get_kv_params).
Raises:: TypeError – If config does not implement ModelConfigWithKVCache.

Initializes the paged memory planner.

Parameters:: config (Any) – Must implement ModelConfigWithKVCache.
Raises:: TypeError – If config does not satisfy ModelConfigWithKVCache.

`estimate_activation_memory()`

estimate_activation_memory(pipeline_config, huggingface_config)

source

Returns the fixed activation-memory reservation for this planner.

The default is 0. Subclasses created via with_activation_reservation() return the configured value.

Parameters:

pipeline_config (Any) – Unused by the default implementation.
huggingface_config (Any) – Unused by the default implementation.

Returns:

Activation memory reservation in bytes.

Return type:

int

`with_activation_reservation()`

classmethod with_activation_reservation(activation_bytes, always_signal_buffers=False)

source

Returns a PagedMemoryPlanner subclass with a fixed activation-memory reservation.

Use this instead of writing a custom MemoryPlanner subclass for architectures that simply need to reserve a fixed chunk of GPU memory before KV cache allocation (e.g. for vision processing headroom):

memory_planner=PagedMemoryPlanner.with_activation_reservation(
    15 * 1024**3  # 15 GiB
)

For models that perform allreduce unconditionally (e.g. VLMs using VocabParallelEmbedding), pass always_signal_buffers=True so signal-buffer memory is reserved even on single-GPU:

memory_planner=PagedMemoryPlanner.with_activation_reservation(
    15 * 1024**3, always_signal_buffers=True
)

Parameters:

activation_bytes (int) – Activation memory to reserve in bytes.
always_signal_buffers (bool) – When True, reserve signal-buffer memory even on single-device pipelines.

Returns:

A new PagedMemoryPlanner subclass whose estimate_activation_memory() returns activation_bytes.

Return type:

type[PagedMemoryPlanner]

PagedMemoryPlanner​

estimate_activation_memory()​

with_activation_reservation()​

`PagedMemoryPlanner`

`estimate_activation_memory()`

`with_activation_reservation()`