IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.minimax_m2

MiniMaxM2Config​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, use_subgraphs=True, data_parallel_degree=1, num_local_experts=256, num_experts_per_tok=8, norm_topk_prob=True, correction_bias_dtype=None, gate_dtype=None, attn_dtype=None, ep_config=None, partial_rotary_factor=1.0)

source

Bases: Llama3Config

Configuration for MiniMax-M2 MoE models.

Extends Llama3Config with MoE-specific parameters including sigmoid routing with expert score correction bias.

Parameters:

attn_dtype​

attn_dtype: DType | None = None

source

Data type for attention weights. Detected from state dict during finalize().

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

The attention multiplier for MiniMax-M2 models.

Uses the explicit head_dim from the config.

Parameters:

huggingface_config (AutoConfig) – The HuggingFace configuration object.

Returns:

The attention multiplier value.

Return type:

float

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Constructs KV cache parameters using explicit head_dim from config.

Parameters:

  • huggingface_config (AutoConfig) – The HuggingFace configuration object.
  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • devices (list[DeviceRef]) – Devices to use for the KV cache.
  • kv_cache_config (KVCacheConfig) – Configuration for KV cache.
  • cache_dtype (DType) – Data type for the cache.

Returns:

KVCacheParams object with the correct head_dim from config.

Return type:

KVCacheParams

correction_bias_dtype​

correction_bias_dtype: DType | None = None

source

Data type of the e_score_correction_bias weight. Detected from state dict during finalize().

ep_config​

ep_config: EPConfig | None = None

source

Expert parallelism configuration. None means no EP (single-GPU).

gate_dtype​

gate_dtype: DType | None = None

source

Data type for the gate linear layer. Detected from state dict during finalize().

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a MiniMaxM2Config from pipeline configuration.

Parameters:

Returns:

An initialized MiniMaxM2Config instance.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initializes a MiniMaxM2Config from pipeline and HuggingFace configs.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – The HuggingFace model configuration.
  • model_config (MAXModelConfig | None) – The MAX Engine model configuration.

Returns:

An initialized MiniMaxM2Config instance.

Return type:

Self

norm_topk_prob​

norm_topk_prob: bool = True

source

Whether to normalize top-k expert probabilities to sum to 1.

num_experts_per_tok​

num_experts_per_tok: int = 8

source

Number of experts selected per token.

num_local_experts​

num_local_experts: int = 256

source

Number of local experts in each MoE layer.

partial_rotary_factor​

partial_rotary_factor: float = 1.0

source

Fraction of head_dim used for rotary embeddings. For MiniMax-M2: rotary_dim/head_dim = 64/128 = 0.5.

MiniMaxM2Inputs​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, lora_grouped_offsets=None, num_active_loras=None, lora_end_idx=None, batch_seq_len=None, lora_ids_kv=None, lora_grouped_offsets_kv=None, data_parallel_splits=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, ep_inputs=(), host_input_row_offsets=None)

source

Bases: Llama3Inputs

Inputs for MiniMax-M2 with EP and DP support.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

ep_inputs​

ep_inputs: tuple[Buffer, ...] = ()

source

host_input_row_offsets​

host_input_row_offsets: Buffer | None = None

source

MiniMaxM2Model​

class max.pipelines.architectures.minimax_m2.MiniMaxM2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

MiniMax-M2 pipeline model for text generation.

Uses AlwaysSignalBuffersMixin since VocabParallelEmbedding and ColumnParallelLinear always require signal buffers for allreduce.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

get_kv_params()​

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Returns the KV cache params for the pipeline model.

Parameters:

Return type:

KVCacheParams

load_model()​

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model​

model: Model

source

Compiled and initialized model ready for inference.

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

MiniMaxM2Inputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepare the inputs for the next token in multistep execution. This should avoid any device synchronization or copy operations.

Parameters:

Return type:

MiniMaxM2Inputs

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.

MiniMaxM2ReasoningParser​

class max.pipelines.architectures.minimax_m2.MiniMaxM2ReasoningParser(think_start_token_id, think_end_token_id, tool_call_start_token_id=None)

source

Bases: ReasoningParser

MiniMax-M2 reasoning parser for … sections.

Reasoning may end implicitly when a tool call begins (minimax:tool_call).

Reasoning may begin implicitly, without an explicit token (the chat template appends to the assistant turn).

Parameters:

  • think_start_token_id (int)
  • think_end_token_id (int)
  • tool_call_start_token_id (int | None)

from_tokenizer()​

async classmethod from_tokenizer(tokenizer)

source

Construct a reasoning parser from a tokenizer.

Parameters:

tokenizer (PipelineTokenizer[Any, Any, Any])

Return type:

MiniMaxM2ReasoningParser

stream()​

stream(delta_token_ids)

source

Identify a reasoning span within a streaming delta chunk.

Parameters:

delta_token_ids (Sequence[int])

Return type:

tuple[ReasoningSpan, bool]