IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.step3p5

Step3p5Config​

class max.pipelines.architectures.step3p5.Step3p5Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, sliding_window=512, num_attention_groups=8, head_dim=128, layer_types=<factory>, sliding_num_attention_heads=96, sliding_num_attention_groups=8, per_layer_rope_theta=<factory>, partial_rotary_factors=<factory>, yarn_only_types=<factory>, use_head_wise_attn_gate=True, moe_num_experts=288, moe_top_k=8, moe_intermediate_size=1280, share_expert_dim=1280, moe_layers=<factory>, moe_router_scaling_factor=3.0, norm_expert_weight=True, swiglu_limits=<factory>, swiglu_limits_shared=<factory>)

source

Bases: Llama3Config

Model configuration for Step-3.5-Flash.

Parameters:

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

Compute the attention scale for Step-3.5.

Parameters:

huggingface_config (AutoConfig) – The HuggingFace configuration object.

Returns:

The attention multiplier value.

Return type:

float

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Construct KV cache parameters for Step-3.5.

Uses the maximum number of KV heads across all layer types, since the KV cache is allocated per-layer and sliding layers may have more KV heads than full attention layers.

Parameters:

  • huggingface_config (AutoConfig) – The HuggingFace configuration object.
  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • devices (list[DeviceRef]) – Devices to use for the KV cache.
  • kv_cache_config (KVCacheConfig) – Configuration for KV cache.
  • cache_dtype (DType) – Data type for the cache.

Returns:

KVCacheParams object.

Return type:

KVCacheParams

head_dim​

head_dim: int = 128

source

Dimension of each attention head.

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a Step3p5Config instance from pipeline configuration.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • model_config (MAXModelConfig | None) – Optional MAX model configuration override.

Returns:

An initialized Step3p5Config instance.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initializes a Step3p5Config instance from pipeline and HuggingFace configs.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – The HuggingFace model configuration.
  • model_config (MAXModelConfig | None) – Optional MAX model configuration override.

Returns:

An initialized Step3p5Config instance.

Return type:

Self

layer_types​

layer_types: list[str]

source

β€˜full_attention’ or β€˜sliding_attention’.

Type:

Per-layer attention type

moe_intermediate_size​

moe_intermediate_size: int = 1280

source

Intermediate dimension of each MoE expert MLP.

moe_layers​

moe_layers: set[int]

source

Set of layer indices that use MoE (vs dense MLP).

moe_num_experts​

moe_num_experts: int = 288

source

Number of routed experts in MoE layers.

moe_router_scaling_factor​

moe_router_scaling_factor: float = 3.0

source

Scaling factor applied to routed expert weights.

moe_top_k​

moe_top_k: int = 8

source

Number of experts activated per token.

norm_expert_weight​

norm_expert_weight: bool = True

source

Whether to normalize top-k expert weights to sum to 1.

num_attention_groups​

num_attention_groups: int = 8

source

Number of KV head groups (same as num_key_value_heads for full attn).

partial_rotary_factors​

partial_rotary_factors: list[float]

source

Per-layer partial rotary factors (0.5 for full attn, 1.0 for sliding).

per_layer_rope_theta​

per_layer_rope_theta: list[float]

source

Per-layer RoPE theta values. If empty, uses a single rope_theta.

share_expert_dim​

share_expert_dim: int = 1280

source

Intermediate dimension of the shared expert MLP.

sliding_num_attention_groups​

sliding_num_attention_groups: int = 8

source

Number of KV head groups for sliding attention layers.

sliding_num_attention_heads​

sliding_num_attention_heads: int = 96

source

Number of attention heads for sliding attention layers.

sliding_window​

sliding_window: int = 512

source

Sliding window size for local attention layers.

swiglu_limits​

swiglu_limits: list[float]

source

Per-layer SwiGLU activation clipping thresholds for routed experts. 0.0 means no clipping. Non-zero values clamp intermediate activations.

swiglu_limits_shared​

swiglu_limits_shared: list[float]

source

Per-layer SwiGLU activation clipping thresholds for shared experts.

use_head_wise_attn_gate​

use_head_wise_attn_gate: bool = True

source

Whether to use per-head sigmoid attention gating (g_proj).

yarn_only_types​

yarn_only_types: list[str]

source

Layer types that use rope_scaling (e.g. [β€˜full_attention’]).

Step3p5Inputs​

class max.pipelines.architectures.step3p5.Step3p5Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, data_parallel_splits=None, host_input_row_offsets=None, ep_inputs=<factory>, *, kv_cache_inputs=None, lora=None, hidden_states=None)

source

Bases: Llama3Inputs

Inputs for Step-3.5 in TP+EP and DP+EP modes.

Extends Llama3Inputs with optional host_input_row_offsets / data_parallel_splits (DP+EP only) and the EP communication buffers (TP+EP and DP+EP).

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

ep_inputs​

ep_inputs: tuple[Buffer, ...]

source

host_input_row_offsets​

host_input_row_offsets: Buffer | None = None

source

Step3p5Model​

class max.pipelines.architectures.step3p5.Step3p5Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

Step-3.5-Flash pipeline model.

Supports single-GPU, multi-GPU TP, TP-attention + EP-MoE, and DP-attention + EP-MoE.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

load_model()​

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model​

model: Model

source

Compiled and initialized model ready for inference.

model_config_cls​

model_config_cls

source

alias of Step3p5Config

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

Llama3Inputs | Step3p5Inputs

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.

Step3p5PretrainedConfig​

class max.pipelines.architectures.step3p5.Step3p5PretrainedConfig(**kwargs)

source

Bases: PreTrainedConfig

Custom PretrainedConfig for Step-3.5 so AutoConfig.from_pretrained() works.

This is the primary location for mapping Step-3.5 field names to the standard HuggingFace fields that Llama3Config expects. A subset of these aliases is also applied in Step3p5Config._ensure_hf_config_aliases() as a fallback when trust_remote_code=True loads the repo’s own config class instead of this one.

Parameters:

kwargs (object)

model_type​

model_type: str = 'step3p5'

source