IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

MemoryPlanner

MemoryPlanner​

class max.pipelines.kv_cache.MemoryPlanner(config)

source

Bases: object

Base class for pipeline model memory planning.

Provides default implementations for all estimation methods. Subclasses override the methods that require architecture-specific logic:

  • Estimating KV cache memory requirements.
  • Estimating activation, weight, signal-buffer, and vision-cache memory overheads specific to the model.

A MemoryPlanner is constructed from a ModelConfig alone (not from a full PipelineConfig) so that it can be used independently of the pipeline stack.

Initializes the memory planner with the model config.

Parameters:

config (Any) – Model configuration.

estimate_activation_memory()​

estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates activation memory beyond model weights.

The default implementation returns 0. Override in subclasses that require temporary buffers for large intermediate tensors (e.g. MLA up-projection during prefill, expert-parallel routing buffers).

Parameters:

  • pipeline_config (Any) – Pipeline configuration.
  • huggingface_config (Any) – HuggingFace model configuration.

Returns:

Estimated activation memory in bytes.

Return type:

int

estimate_signal_buffer_memory()​

estimate_signal_buffer_memory(pipeline_config, arch_config=None)

source

Estimates signal-buffer memory in bytes across all devices.

Signal buffers are fixed-size per-GPU allocations used by P2P collectives. The default returns 0 for single-device pipelines and delegates to pipeline_config.estimate_signal_buffer_memory for multi-device.

Models that perform allreduce unconditionally (e.g. via VocabParallelEmbedding) need signal buffers even on a single device. Set always_signal_buffers=True on the planner class to enable this.

Parameters:

  • pipeline_config (Any) – Pipeline configuration.
  • arch_config (Any | None) – Optional architecture config; when provided, tightens the BlockOffloadEngine term using the actual replicates_kv_across_tp flag.

Returns:

Estimated signal-buffer memory in bytes across all devices.

Return type:

int

estimate_vision_cache_entry_bytes()​

estimate_vision_cache_entry_bytes(huggingface_config)

source

Estimates bytes for one vision encoder cache entry.

The default implementation returns 0. Override in VLM planners to return the worst-case memory for a single max-resolution image after the vision encoder’s spatial merge / patch merge step.

Parameters:

huggingface_config (Any) – HuggingFace model configuration.

Returns:

Estimated bytes per vision cache entry, or 0 for text-only models.

Return type:

int

estimate_weights_size()​

estimate_weights_size(pipeline_config)

source

Estimates the memory consumed by model weights in bytes.

The default implementation delegates to pipeline_config.model.weights_size(). Override in subclasses that need architecture-specific weight accounting (e.g. expert-parallel sharding adjustments).

Parameters:

pipeline_config (Any) – Pipeline configuration providing the model config.

Returns:

Estimated weight memory in bytes.

Return type:

int