IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.step3p5

Step3p5Config

class max.pipelines.architectures.step3p5.Step3p5Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, sliding_window=512, num_attention_groups=8, head_dim=128, layer_types=<factory>, sliding_num_attention_heads=96, sliding_num_attention_groups=8, per_layer_rope_theta=<factory>, partial_rotary_factors=<factory>, yarn_only_types=<factory>, use_head_wise_attn_gate=True, moe_num_experts=288, moe_top_k=8, moe_intermediate_size=1280, share_expert_dim=1280, moe_layers=<factory>, moe_router_scaling_factor=3.0, norm_expert_weight=True, swiglu_limits=<factory>, swiglu_limits_shared=<factory>)

source

Bases: Llama3Config

Model configuration for Step-3.5-Flash.

Parameters:

calculate_attention_multiplier()

static calculate_attention_multiplier(huggingface_config)

source

Compute the attention scale for Step-3.5.

Parameters:

huggingface_config (AutoConfig) – The HuggingFace configuration object.

Returns:

The attention multiplier value.

Return type:

float

construct_kv_params()

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Construct KV cache parameters for Step-3.5.

Uses the maximum number of KV heads across all layer types, since the KV cache is allocated per-layer and sliding layers may have more KV heads than full attention layers.

Parameters:

  • huggingface_config (AutoConfig) – The HuggingFace configuration object.
  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • devices (list[DeviceRef]) – Devices to use for the KV cache.
  • kv_cache_config (KVCacheConfig) – Configuration for KV cache.
  • cache_dtype (DType) – Data type for the cache.

Returns:

KVCacheParams object.

Return type:

KVCacheParams

head_dim

head_dim: int = 128

source

Dimension of each attention head.

initialize()

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a Step3p5Config instance from pipeline configuration.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • model_config (MAXModelConfig | None) – Optional MAX model configuration override.

Returns:

An initialized Step3p5Config instance.

Return type:

Self

initialize_from_config()

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initializes a Step3p5Config instance from pipeline and HuggingFace configs.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – The HuggingFace model configuration.
  • model_config (MAXModelConfig | None) – Optional MAX model configuration override.

Returns:

An initialized Step3p5Config instance.

Return type:

Self

layer_types

layer_types: list[str]

source

‘full_attention’ or ‘sliding_attention’.

Type:

Per-layer attention type

moe_intermediate_size

moe_intermediate_size: int = 1280

source

Intermediate dimension of each MoE expert MLP.

moe_layers

moe_layers: set[int]

source

Set of layer indices that use MoE (vs dense MLP).

moe_num_experts

moe_num_experts: int = 288

source

Number of routed experts in MoE layers.

moe_router_scaling_factor

moe_router_scaling_factor: float = 3.0

source

Scaling factor applied to routed expert weights.

moe_top_k

moe_top_k: int = 8

source

Number of experts activated per token.

norm_expert_weight

norm_expert_weight: bool = True

source

Whether to normalize top-k expert weights to sum to 1.

num_attention_groups

num_attention_groups: int = 8

source

Number of KV head groups (same as num_key_value_heads for full attn).

partial_rotary_factors

partial_rotary_factors: list[float]

source

Per-layer partial rotary factors (0.5 for full attn, 1.0 for sliding).

per_layer_rope_theta

per_layer_rope_theta: list[float]

source

Per-layer RoPE theta values. If empty, uses a single rope_theta.

share_expert_dim

share_expert_dim: int = 1280

source

Intermediate dimension of the shared expert MLP.

sliding_num_attention_groups

sliding_num_attention_groups: int = 8

source

Number of KV head groups for sliding attention layers.

sliding_num_attention_heads

sliding_num_attention_heads: int = 96

source

Number of attention heads for sliding attention layers.

sliding_window

sliding_window: int = 512

source

Sliding window size for local attention layers.

swiglu_limits

swiglu_limits: list[float]

source

Per-layer SwiGLU activation clipping thresholds for routed experts. 0.0 means no clipping. Non-zero values clamp intermediate activations.

swiglu_limits_shared

swiglu_limits_shared: list[float]

source

Per-layer SwiGLU activation clipping thresholds for shared experts.

use_head_wise_attn_gate

use_head_wise_attn_gate: bool = True

source

Whether to use per-head sigmoid attention gating (g_proj).

yarn_only_types

yarn_only_types: list[str]

source

Layer types that use rope_scaling (e.g. [‘full_attention’]).

Step3p5Inputs

class max.pipelines.architectures.step3p5.Step3p5Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, data_parallel_splits=None, host_input_row_offsets=None, ep_inputs=<factory>, *, kv_cache_inputs=None, lora=None, hidden_states=None)

source

Bases: Llama3Inputs

Inputs for Step-3.5 in TP+EP and DP+EP modes.

Extends Llama3Inputs with optional host_input_row_offsets / data_parallel_splits (DP+EP only) and the EP communication buffers (TP+EP and DP+EP).

Parameters:

buffers

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

ep_inputs

ep_inputs: tuple[Buffer, ...]

source

host_input_row_offsets

host_input_row_offsets: Buffer | None = None

source

Step3p5Model

class max.pipelines.architectures.step3p5.Step3p5Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

Step-3.5-Flash pipeline model.

Supports single-GPU, multi-GPU TP, TP-attention + EP-MoE, and DP-attention + EP-MoE.

Parameters:

attention_bias

attention_bias: bool = False

source

Whether to use attention bias.

load_model()

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model

model: Model

source

Compiled and initialized model ready for inference.

model_config_cls

model_config_cls

source

alias of Step3p5Config

norm_method

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

Llama3Inputs | Step3p5Inputs

state_dict

state_dict: dict[str, Any]

source

Weights to load into the model.

Step3p5PretrainedConfig

class max.pipelines.architectures.step3p5.Step3p5PretrainedConfig(**kwargs)

source

Bases: PreTrainedConfig

Custom PretrainedConfig for Step-3.5 so AutoConfig.from_pretrained() works.

This is the primary location for mapping Step-3.5 field names to the standard HuggingFace fields that Llama3Config expects. A subset of these aliases is also applied in Step3p5Config._ensure_hf_config_aliases() as a fallback when trust_remote_code=True loads the repo’s own config class instead of this one.

Parameters:

kwargs (object)

model_type

model_type: str = 'step3p5'

source