For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python module
max.pipelines.architectures.gpt_oss_modulev3
GPT-OSS mixture-of-experts architecture for text generation.
GptOssConfigβ
class max.pipelines.architectures.gpt_oss_modulev3.GptOssConfig(*, vocab_size, hidden_size, intermediate_size, num_hidden_layers, num_attention_heads, num_key_value_heads, head_dim, hidden_activation, max_position_embeddings, rms_norm_eps, rope_theta, attention_bias, sliding_window, num_local_experts, num_experts_per_tok, router_aux_loss_coef, layer_types, attention_dropout, rope_scaling, query_pre_attn_scalar, final_logit_softcapping, attn_logit_softcapping, swiglu_limit, dtype, devices, interleaved_rope_weights, kv_params, tie_word_embeddings=False, return_logits=ReturnLogits.LAST_TOKEN)
Bases: ArchConfigWithPermissiveMaxSeqLen, ArchConfigWithStoredKVParams, ArchConfigWithKVCache
Configuration for GPT OSS models.
Contains parameters specific to the GPT OSS architecture, typically extracted from a HuggingFace configuration objectβs text config.
-
Parameters:
-
- vocab_size (int)
- hidden_size (int)
- intermediate_size (int)
- num_hidden_layers (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- head_dim (int)
- hidden_activation (str)
- max_position_embeddings (int)
- rms_norm_eps (float)
- rope_theta (float)
- attention_bias (bool)
- sliding_window (int)
- num_local_experts (int)
- num_experts_per_tok (int)
- router_aux_loss_coef (float)
- layer_types (list[str])
- attention_dropout (float)
- rope_scaling (YarnScalingParams)
- query_pre_attn_scalar (float | None)
- final_logit_softcapping (float | None)
- attn_logit_softcapping (float | None)
- swiglu_limit (float)
- dtype (DType)
- devices (list[DeviceRef])
- interleaved_rope_weights (bool)
- kv_params (KVCacheParams)
- tie_word_embeddings (bool)
- return_logits (ReturnLogits)
attention_biasβ
attention_bias: bool
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropoutβ
attention_dropout: float
Dropout probability for attention weights.
attn_logit_softcappingβ
Softcapping value for attention logits.
devicesβ
Devices to run the model with.
dtypeβ
dtype: DType
DType of the model weights and input.
final_logit_softcappingβ
Softcapping value for final logits.
finalize()β
finalize(huggingface_config, state_dict, return_logits)
Define parameters that canβt be determined just from the pipeline config.
-
Parameters:
-
- huggingface_config (AutoConfig) β The HuggingFace model configuration object.
- state_dict (dict[str, WeightData]) β The modelβs state dictionary containing weights.
- return_logits (ReturnLogits) β Whether to return the last token, all tokens or a variable number of logits.
-
Return type:
-
None
get_num_layers()β
static get_num_layers(huggingface_config)
Retrieves the number of hidden layers from the HuggingFace configuration.
-
Parameters:
-
huggingface_config (AutoConfig) β The HuggingFace model configuration object (
transformers.AutoConfig). -
Returns:
-
The number of hidden layers specified in the configuration.
-
Return type:
head_dimβ
head_dim: int
The attention head dimension.
hidden_activationβ
hidden_activation: str
The non-linear activation function (function or string) in the decoder. Will default to βgelu_tanhβ if not specified. βgelu_tanhβ uses an approximation of the βgeluβ activation function.
hidden_sizeβ
hidden_size: int
Dimension of the hidden representations.
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes a GptOssConfig instance from pipeline configuration.
This method creates a config instance with all fields that can be determined from the pipeline configuration, without needing the state_dict. Fields that depend on the state_dict (like tie_word_embeddings) should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
An initialized GptOssConfig instance.
-
Return type:
interleaved_rope_weightsβ
interleaved_rope_weights: bool
True if the rope weights are in interleaved complex format.
intermediate_sizeβ
intermediate_size: int
Dimension of the MLP representations.
kv_paramsβ
kv_params: KVCacheParams
KV cache parameters.
layer_typesβ
Type of attention for each layer (βfull_attentionβ or βsliding_attentionβ).
max_position_embeddingsβ
max_position_embeddings: int
The maximum sequence length that this model might ever be used with.
num_attention_headsβ
num_attention_heads: int
Number of attention heads for each attention layer in the Transformer decoder.
num_experts_per_tokβ
num_experts_per_tok: int
Number of experts selected per token in MoE layers.
num_hidden_layersβ
num_hidden_layers: int
Number of hidden layers in the Transformer decoder.
num_key_value_headsβ
num_key_value_heads: int
Number of key_value heads that should be used to implement Grouped Query Attention.
num_local_expertsβ
num_local_experts: int
Number of experts in each MoE layer.
query_pre_attn_scalarβ
Scalar applied to queries before attention computation.
return_logitsβ
return_logits: ReturnLogits = 'last_token'
Whether to return the last token, all logits, or a variable number of logits.
rms_norm_epsβ
rms_norm_eps: float
The epsilon used by the rms normalization layers.
rope_scalingβ
rope_scaling: YarnScalingParams
Scaling configuration for the RoPE embeddings used in global attention.
rope_thetaβ
rope_theta: float
The base period of the RoPE embeddings.
router_aux_loss_coefβ
router_aux_loss_coef: float
Coefficient for the auxiliary load balancing loss in MoE layers.
sliding_windowβ
sliding_window: int
In the GPT OSS language model, specific layers use sliding window attention. This is the size of the sliding window.
swiglu_limitβ
swiglu_limit: float
Clamping limit for SwiGLU activation in MoE layers.
tie_word_embeddingsβ
tie_word_embeddings: bool = False
Whether to tie weight embeddings. When true, the output linear layer uses the same weight as the embedding layer.
vocab_sizeβ
vocab_size: int
Vocabulary size of the GPT OSS model.
GptOssInputsβ
class max.pipelines.architectures.gpt_oss_modulev3.GptOssInputs(tokens, input_row_offsets, return_n_logits, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)
Bases: ModelInputs
A class representing inputs for the GPT OSS model.
This class encapsulates the input tensors required for the GPT OSS model execution.
-
Parameters:
input_row_offsetsβ
input_row_offsets: Buffer
Buffer containing the offsets for each row in the ragged input sequence.
return_n_logitsβ
return_n_logits: Buffer
Number of logits to return.
tokensβ
tokens: Buffer
Buffer containing the input token IDs.
GptOssModelβ
class max.pipelines.architectures.gpt_oss_modulev3.GptOssModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)
Bases: PipelineModelWithKVCache[TextContext]
A GPT OSS pipeline model for text generation.
This class integrates the GPT OSS architecture with the MAX Engine pipeline infrastructure, handling model loading, KV cache management, and input preparation for inference.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The configuration settings for the entire pipeline.
- session (InferenceSession) β The MAX Engine inference session managing the runtime.
- devices (list[Device]) β A list of MAX Engine devices (
max.driver.Device) to run the model on. - kv_cache_config (KVCacheConfig) β Configuration settings for the Key-Value cache
(
max.pipelines.max_config.KVCacheConfig). - weights (Weights) β The model weights (
max.graph.weights.Weights). - adapter (WeightsAdapter | None) β An optional adapter to modify weights before loading
(
max.graph.weights.WeightsAdapter). - return_logits (ReturnLogits) β The number of top logits to return from the model execution.
execute()β
execute(model_inputs)
Executes the GPT OSS model with the prepared inputs.
-
Parameters:
-
model_inputs (ModelInputs) β The prepared inputs for the model execution, typically including token IDs, attention masks/offsets, and KV cache inputs.
-
Returns:
-
An object containing the output logits from the model execution.
-
Return type:
load_model()β
load_model()
Loads the compiled GPT OSS model into the MAX Engine session.
model_config_clsβ
model_config_cls
alias of GptOssConfig
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs for the first execution pass of the GPT OSS model.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextContext]]) β A sequence of sequences of
TextContextobjects representing the input prompts for each replica. - kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None) β Optional inputs required by the KV cache manager.
- return_n_logits (int)
- replica_batches (Sequence[Sequence[TextContext]]) β A sequence of sequences of
-
Returns:
-
The prepared
ModelInputsobject for the initial execution step. -
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!