For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

MemoryPlanner

`MemoryPlanner`

class max.pipelines.kv_cache.MemoryPlanner(config)

source

Bases: object

Base class for pipeline model memory planning.

Provides default implementations for all estimation methods. Subclasses override the methods that require architecture-specific logic:

Estimating KV cache memory requirements.
Estimating activation, weight, signal-buffer, and vision-cache memory overheads specific to the model.

A MemoryPlanner is constructed from a ModelConfig alone (not from a full PipelineConfig) so that it can be used independently of the pipeline stack.

Initializes the memory planner with the model config.

Parameters:: config (Any) – Model configuration.

`estimate_activation_memory()`

estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates activation memory beyond model weights.

The default implementation returns 0. Override in subclasses that require temporary buffers for large intermediate tensors (e.g. MLA up-projection during prefill, expert-parallel routing buffers).

Parameters:

pipeline_config (Any) – Pipeline configuration.
huggingface_config (Any) – HuggingFace model configuration.

Returns:

Estimated activation memory in bytes.

Return type:

int

`estimate_signal_buffer_memory()`

estimate_signal_buffer_memory(pipeline_config, arch_config=None)

source

Estimates signal-buffer memory in bytes across all devices.

Signal buffers are fixed-size per-GPU allocations used by P2P collectives. The default returns 0 for single-device pipelines and delegates to pipeline_config.estimate_signal_buffer_memory for multi-device.

Models that perform allreduce unconditionally (e.g. via VocabParallelEmbedding) need signal buffers even on a single device. Set always_signal_buffers=True on the planner class to enable this.

Parameters:

pipeline_config (Any) – Pipeline configuration.
arch_config (Any | None) – Optional architecture config; when provided, tightens the BlockOffloadEngine term using the actual replicates_kv_across_tp flag.

Returns:

Estimated signal-buffer memory in bytes across all devices.

Return type:

int

`estimate_vision_cache_entry_bytes()`

estimate_vision_cache_entry_bytes(huggingface_config)

source

Estimates bytes for one vision encoder cache entry.

The default implementation returns 0. Override in VLM planners to return the worst-case memory for a single max-resolution image after the vision encoder’s spatial merge / patch merge step.

Parameters:: huggingface_config (Any) – HuggingFace model configuration.
Returns:: Estimated bytes per vision cache entry, or 0 for text-only models.
Return type:: int

`estimate_weights_size()`

estimate_weights_size(pipeline_config)

source

Estimates the memory consumed by model weights in bytes.

The default implementation delegates to pipeline_config.model.weights_size(). Override in subclasses that need architecture-specific weight accounting (e.g. expert-parallel sharding adjustments).

Parameters:: pipeline_config (Any) – Pipeline configuration providing the model config.
Returns:: Estimated weight memory in bytes.
Return type:: int

MemoryPlanner​

estimate_activation_memory()​

estimate_signal_buffer_memory()​

estimate_vision_cache_entry_bytes()​

estimate_weights_size()​

`MemoryPlanner`

`estimate_activation_memory()`

`estimate_signal_buffer_memory()`

`estimate_vision_cache_entry_bytes()`

`estimate_weights_size()`