IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.minimax_m2

MiniMaxM2Config​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, sliding_window=None, num_local_experts=256, num_experts_per_tok=8, norm_topk_prob=True, correction_bias_dtype=None, gate_dtype=None, attn_dtype=None, ep_config=None, partial_rotary_factor=1.0)

source

Bases: Llama3Config

Configuration for MiniMax-M2 MoE models.

Extends Llama3Config with MoE-specific parameters including sigmoid routing with expert score correction bias.

Parameters:

attn_dtype​

attn_dtype: DType | None = None

source

Data type for attention weights. Detected from state dict during finalize().

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

The attention multiplier for MiniMax-M2 models.

Uses the explicit head_dim from the config.

Parameters:

huggingface_config (AutoConfig) – The HuggingFace configuration object.

Returns:

The attention multiplier value.

Return type:

float

correction_bias_dtype​

correction_bias_dtype: DType | None = None

source

Data type of the e_score_correction_bias weight. Detected from state dict during finalize().

ep_config​

ep_config: EPConfig | None = None

source

Expert parallelism configuration. None means no EP (single-GPU).

gate_dtype​

gate_dtype: DType | None = None

source

Data type for the gate linear layer. Detected from state dict during finalize().

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a MiniMaxM2Config from pipeline configuration.

Parameters:

Returns:

An initialized MiniMaxM2Config instance.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initializes a MiniMaxM2Config from pipeline and HuggingFace configs.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – The HuggingFace model configuration.
  • model_config (MAXModelConfig | None) – The MAX Engine model configuration.

Returns:

An initialized MiniMaxM2Config instance.

Return type:

Self

norm_topk_prob​

norm_topk_prob: bool = True

source

Whether to normalize top-k expert probabilities to sum to 1.

num_experts_per_tok​

num_experts_per_tok: int = 8

source

Number of experts selected per token.

num_local_experts​

num_local_experts: int = 256

source

Number of local experts in each MoE layer.

partial_rotary_factor​

partial_rotary_factor: float = 1.0

source

Fraction of head_dim used for rotary embeddings. For MiniMax-M2: rotary_dim/head_dim = 64/128 = 0.5.

MiniMaxM2Inputs​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, lora_grouped_offsets=None, num_active_loras=None, lora_end_idx=None, batch_seq_len=None, lora_ids_kv=None, lora_grouped_offsets_kv=None, data_parallel_splits=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, ep_inputs=(), host_input_row_offsets=None)

source

Bases: Llama3Inputs

Inputs for MiniMax-M2 with EP and DP support.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

ep_inputs​

ep_inputs: tuple[Buffer, ...] = ()

source

host_input_row_offsets​

host_input_row_offsets: Buffer | None = None

source

MiniMaxM2Model​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

MiniMax-M2 pipeline model for text generation.

Uses AlwaysSignalBuffersMixin since VocabParallelEmbedding and ColumnParallelLinear always require signal buffers for allreduce.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

estimate_activation_memory()​

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for model execution.

This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.

The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – Hugging Face model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

load_model()​

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model​

model: Model

source

Compiled and initialized model ready for inference.

model_config_cls​

model_config_cls

source

alias of MiniMaxM2Config

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

MiniMaxM2Inputs

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.

MiniMaxM2ReasoningParser​

class max.pipelines.architectures.minimax_m2.MiniMaxM2ReasoningParser(think_start_token_id, think_end_token_id, tool_call_start_token_id=None)

source

Bases: ReasoningParser

MiniMax-M2 reasoning parser for sections framed by <think> and </think>.

Reasoning may end implicitly when a tool call begins (<minimax:tool_call>).

Reasoning may begin implicitly, without an explicit <think> token (the chat template appends <think> to the assistant turn).

Parameters:

  • think_start_token_id (int)
  • think_end_token_id (int)
  • tool_call_start_token_id (int | None)

from_tokenizer()​

async classmethod from_tokenizer(tokenizer)

source

Construct a reasoning parser from a tokenizer.

Parameters:

tokenizer (PipelineTokenizer[Any, Any, Any])

Return type:

MiniMaxM2ReasoningParser

reasoning_end_token_id()​

async classmethod reasoning_end_token_id(tokenizer)

source

Returns the </think> token id.

Parameters:

tokenizer (PipelineTokenizer[Any, Any, Any])

Return type:

int | None

stream()​

stream(delta_token_ids, is_currently_reasoning=True)

source

Identify a reasoning span within a streaming delta chunk.

When is_currently_reasoning=False and the chunk contains no <think> opener, returns an empty span so non-reasoning chunks (turns where the chat template prefilled </think>, or any chunk after reasoning ended in a prior chunk) aren’t misclassified as reasoning.

Parameters:

Return type:

ParsedReasoningDelta

will_reason_after_prompt()​

will_reason_after_prompt(prompt_token_ids)

source

Predicts whether the model will emit reasoning after this prompt.

Only checks for </think> β€” not <minimax:tool_call> β€” because the chat template embeds tool-call format tokens in the system prompt when tools are provided, which must not disable reasoning for the generation that follows.

Parameters:

prompt_token_ids (Sequence[int])

Return type:

bool

MinimaxM2ToolParser​

class max.pipelines.architectures.minimax_m2.MinimaxM2ToolParser

source

Bases: StructuralTagToolParser

Parses MiniMax M2-style tool calls from model responses.

MiniMax M2 wraps tool calls in <minimax:tool_call> and uses <invoke name=...> / </invoke> for individual calls. The base class drives buffer accumulation and section/call iteration; we customize the body split ("name"> is the header, the rest is parameter XML) and provide a structured-to-JSON conversion so the base’s argument diffing can produce monotonically-growing JSON.

CALL_BEGIN​

CALL_BEGIN: ClassVar[str] = '<invoke name='

source

CALL_END​

CALL_END: ClassVar[str] = '</invoke>'

source

SECTION_BEGIN​

SECTION_BEGIN: ClassVar[str] = '<minimax:tool_call>'

source

SECTION_END​

SECTION_END: ClassVar[str] = '</minimax:tool_call>'

source

generate_tool_call_grammar()​

static generate_tool_call_grammar(response_format_schema=None, tools=None, **kwargs)

source

Generates a Lark grammar for MiniMax M2 constrained decoding.

Uses Lark rules for the envelope structure and regex terminals only at leaves. Special tokens <minimax:tool_call> and </minimax:tool_call> (IDs 200052/200053) are referenced as single-token symbols to avoid llguidance ParserTooComplex on multi-byte literal matches.

Parameter values are captured via a param_value rule with suffix="</parameter>" so that the </parameter> boundary is handled atomically at the grammar level. This allows values to contain literal < characters (HTML, JSON, comparison operators, etc.) without restriction.

No trailing free-form text is allowed after the closing </minimax:tool_call> tag: under constrained decoding the model is expected to emit EOS immediately after the envelope.

When response_format_schema is provided, adds an alternative branch that accepts JSON content matching the schema.

Parameters:

  • response_format_schema (dict[str, Any] | None) – Optional JSON schema dict. When provided, the grammar also accepts a JSON response matching the schema.
  • tools (list[dict[str, Any]] | None) – Optional list of OpenAI-style tool dicts. None accepts any tool name.
  • **kwargs (Any) – Ignored; accepts tokenizer and other future kwargs.

Returns:

A grammar string compatible with LLMatcher.

Return type:

str