For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.gpt_oss

GPT-OSS mixture-of-experts architecture for text generation.

`GptOssConfig`

class max.pipelines.architectures.gpt_oss.GptOssConfig(*, vocab_size, hidden_size, intermediate_size, num_hidden_layers, num_attention_heads, num_key_value_heads, head_dim, hidden_activation, max_position_embeddings, rms_norm_eps, rope_theta, attention_bias, sliding_window, num_local_experts, num_experts_per_tok, router_aux_loss_coef, layer_types, attention_dropout, rope_scaling, query_pre_attn_scalar, final_logit_softcapping, attn_logit_softcapping, swiglu_limit, dtype, devices, interleaved_rope_weights, kv_params, quant_config=None, tie_word_embeddings=False, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: ArchConfigWithPermissiveMaxSeqLen, ArchConfigWithStoredKVParams, ArchConfigWithKVCache

Configuration for GPT OSS models.

Contains parameters specific to the GPT OSS architecture, typically extracted from a HuggingFace configuration object’s text config.

Parameters:

vocab_size (int)
hidden_size (int)
intermediate_size (int)
num_hidden_layers (int)
num_attention_heads (int)
num_key_value_heads (int)
head_dim (int)
hidden_activation (str)
max_position_embeddings (int)
rms_norm_eps (float)
rope_theta (float)
attention_bias (bool)
sliding_window (int)
num_local_experts (int)
num_experts_per_tok (int)
router_aux_loss_coef (float)
layer_types (list[str])
attention_dropout (float)
rope_scaling (YarnScalingParams)
query_pre_attn_scalar (float | None)
final_logit_softcapping (float | None)
attn_logit_softcapping (float | None)
swiglu_limit (float)
dtype (DType)
devices (list[DeviceRef])
interleaved_rope_weights (bool)
kv_params (KVCacheParams)
quant_config (QuantConfig | None)
tie_word_embeddings (bool)
return_logits (ReturnLogits)

`attention_bias`

attention_bias: bool

source

Whether to use a bias in the query, key, value and output projection layers during self-attention.

`attention_dropout`

attention_dropout: float

source

Dropout probability for attention weights.

`attn_logit_softcapping`

attn_logit_softcapping: float | None

source

Softcapping value for attention logits.

`devices`

devices: list[DeviceRef]

source

Devices to run the model with.

`dtype`

dtype: DType

source

DType of the model weights and input.

`final_logit_softcapping`

final_logit_softcapping: float | None

source

Softcapping value for final logits.

`finalize()`

finalize(huggingface_config, state_dict, return_logits)

source

Define parameters that can’t be determined just from the pipeline config.

Parameters:

huggingface_config (AutoConfig) – The HuggingFace model configuration object.
state_dict (dict[str, WeightData]) – The model’s state dictionary containing weights.
return_logits (ReturnLogits) – Whether to return the last token, all tokens or a variable number of logits.

Return type:

None

`get_num_layers()`

static get_num_layers(huggingface_config)

source

Retrieves the number of hidden layers from the HuggingFace configuration.

Parameters:: huggingface_config (AutoConfig) – The HuggingFace model configuration object (transformers.AutoConfig).
Returns:: The number of hidden layers specified in the configuration.
Return type:: int

`head_dim`

head_dim: int

source

The attention head dimension.

`hidden_activation`

hidden_activation: str

source

The non-linear activation function (function or string) in the decoder. Will default to “gelu_tanh” if not specified. “gelu_tanh” uses an approximation of the “gelu” activation function.

`hidden_size`

hidden_size: int

source

Dimension of the hidden representations.

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a GptOssConfig instance from pipeline configuration.

This method creates a config instance with all fields that can be determined from the pipeline configuration, without needing the state_dict. Fields that depend on the state_dict (like tie_word_embeddings) should be set via the finalize() method.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
model_config (MAXModelConfig | None)

Returns:

An initialized GptOssConfig instance.

Return type:

Self

`interleaved_rope_weights`

interleaved_rope_weights: bool

source

True if the rope weights are in interleaved complex format.

`intermediate_size`

intermediate_size: int

source

Dimension of the MLP representations.

`kv_params`

kv_params: KVCacheParams

source

KV cache parameters.

`layer_types`

layer_types: list[str]

source

Type of attention for each layer (‘full_attention’ or ‘sliding_attention’).

`max_position_embeddings`

max_position_embeddings: int

source

The maximum sequence length that this model might ever be used with.

`num_attention_heads`

num_attention_heads: int

source

Number of attention heads for each attention layer in the Transformer decoder.

`num_experts_per_tok`

num_experts_per_tok: int

source

Number of experts selected per token in MoE layers.

`num_hidden_layers`

num_hidden_layers: int

source

Number of hidden layers in the Transformer decoder.

`num_key_value_heads`

num_key_value_heads: int

source

Number of key_value heads that should be used to implement Grouped Query Attention.

`num_local_experts`

num_local_experts: int

source

Number of experts in each MoE layer.

`quant_config`

quant_config: QuantConfig | None = None

source

Float8/Float4 quantization configuration, if applicable.

`query_pre_attn_scalar`

query_pre_attn_scalar: float | None

source

Scalar applied to queries before attention computation.

`return_logits`

return_logits: ReturnLogits = 'last_token'

source

Whether to return the last token, all logits, or a variable number of logits.

`rms_norm_eps`

rms_norm_eps: float

source

The epsilon used by the rms normalization layers.

`rope_scaling`

rope_scaling: YarnScalingParams

source

Scaling configuration for the RoPE embeddings used in global attention.

`rope_theta`

rope_theta: float

source

The base period of the RoPE embeddings.

`router_aux_loss_coef`

router_aux_loss_coef: float

source

Coefficient for the auxiliary load balancing loss in MoE layers.

`sliding_window`

sliding_window: int

source

In the GPT OSS language model, specific layers use sliding window attention. This is the size of the sliding window.

`swiglu_limit`

swiglu_limit: float

source

Clamping limit for SwiGLU activation in MoE layers.

`tie_word_embeddings`

tie_word_embeddings: bool = False

source

Whether to tie weight embeddings. When true, the output linear layer uses the same weight as the embedding layer.

`vocab_size`

vocab_size: int

source

Vocabulary size of the GPT OSS model.

`GptOssInputs`

class max.pipelines.architectures.gpt_oss.GptOssInputs(tokens, input_row_offsets, signal_buffers, return_n_logits, *, kv_cache_inputs=None, lora=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the GPT OSS model.

This class encapsulates the input tensors required for the GPT OSS model execution.

Parameters:

tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]] | Buffer)
input_row_offsets (ndarray[tuple[Any, ...], dtype[integer[Any]]] | Buffer | list[Buffer])
signal_buffers (list[Buffer])
return_n_logits (Buffer)
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
lora (LoRAInputs | None)
hidden_states (Buffer | list[Buffer] | None)

`input_row_offsets`

input_row_offsets: npt.NDArray[np.integer[Any]] | Buffer | list[Buffer]

source

Tensor containing the offsets for each row in the ragged input sequence, or the attention mask for the padded input sequence. For distributed execution, this can be a list of tensors, one per device.

`return_n_logits`

return_n_logits: Buffer

source

Number of logits to return.

`signal_buffers`

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

`tokens`

tokens: npt.NDArray[np.integer[Any]] | Buffer

source

Tensor containing the input token IDs.

`GptOssModel`

class max.pipelines.architectures.gpt_oss.GptOssModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, max_batch_size=1)

source

Bases: AlwaysSignalBuffersMixin, GraphPipelineModelWithKVCache[TextContext]

A GPT OSS pipeline model for text generation.

This class integrates the GPT OSS architecture with the MAX Engine pipeline infrastructure, handling model loading, KV cache management, and input preparation for inference.

Parameters:

pipeline_config (PipelineConfig) – The configuration settings for the entire pipeline.
session (InferenceSession) – The MAX Engine inference session managing the runtime.
devices (list[Device]) – A list of MAX Engine devices (max.driver.Device) to run the model on.
kv_cache_config (KVCacheConfig) – Configuration settings for the Key-Value cache (max.pipelines.max_config.KVCacheConfig).
weights (Weights) – The model weights (max.graph.weights.Weights).
adapter (WeightsAdapter | None) – An optional adapter to modify weights before loading (max.graph.weights.WeightsAdapter).
return_logits (ReturnLogits) – The number of top logits to return from the model execution.
max_batch_size (int)

`batch_processor_cls`

batch_processor_cls

source

alias of GptOssBatchProcessor

`execute()`

execute(model_inputs)

source

Executes the GPT OSS model with the prepared inputs.

Parameters:: model_inputs (ModelInputs) – The prepared inputs for the model execution, typically including token IDs, attention masks/offsets, and KV cache inputs.
Returns:: An object containing the output logits from the model execution.
Return type:: ModelOutputs

`model`

model: Model

source

The compiled and initialized MAX Engine model ready for inference.

`model_config_cls`

model_config_cls

source

alias of GptOssConfig

GptOssConfig​

attention_bias​

attention_dropout​

attn_logit_softcapping​

devices​

dtype​

final_logit_softcapping​

finalize()​

get_num_layers()​

head_dim​

hidden_activation​

hidden_size​

initialize()​

interleaved_rope_weights​

intermediate_size​

kv_params​

layer_types​

max_position_embeddings​

num_attention_heads​

num_experts_per_tok​

num_hidden_layers​

num_key_value_heads​

num_local_experts​

quant_config​

query_pre_attn_scalar​

return_logits​

rms_norm_eps​

rope_scaling​

rope_theta​

router_aux_loss_coef​

sliding_window​

swiglu_limit​

tie_word_embeddings​

vocab_size​

GptOssInputs​

input_row_offsets​

return_n_logits​

signal_buffers​

tokens​

GptOssModel​

batch_processor_cls​

execute()​

model​

model_config_cls​

`GptOssConfig`

`attention_bias`

`attention_dropout`

`attn_logit_softcapping`

`devices`

`dtype`

`final_logit_softcapping`

`finalize()`

`get_num_layers()`

`head_dim`

`hidden_activation`

`hidden_size`

`initialize()`

`interleaved_rope_weights`

`intermediate_size`

`kv_params`

`layer_types`

`max_position_embeddings`

`num_attention_heads`

`num_experts_per_tok`

`num_hidden_layers`

`num_key_value_heads`

`num_local_experts`

`quant_config`

`query_pre_attn_scalar`

`return_logits`

`rms_norm_eps`

`rope_scaling`

`rope_theta`

`router_aux_loss_coef`

`sliding_window`

`swiglu_limit`

`tie_word_embeddings`

`vocab_size`

`GptOssInputs`

`input_row_offsets`

`return_n_logits`

`signal_buffers`

`tokens`

`GptOssModel`

`batch_processor_cls`

`execute()`

`model`

`model_config_cls`