IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.olmo2

OLMo 2 transformer architecture for text generation.

Olmo2Config​

class max.pipelines.architectures.olmo2.Olmo2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, use_subgraphs=True, data_parallel_degree=1)

source

Bases: Llama3Config

Implementation of MAXModelConfig for Olmo2 models. Olmo2 models use a different approach for head_dim calculation compared to Llama3. Llama3 calculates head_dim as hidden_size // num_attention_heads, Olmo2 models have an explicit head_dim field in their configuration.

Parameters:

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

The attention multiplier for Olmo2 models. Uses the explicit head_dim from the config instead of calculating it. :param huggingface_config: The HuggingFace configuration object.

Returns:

The attention multiplier value.

Parameters:

huggingface_config (AutoConfig)

Return type:

float

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Override the default Llama3Config.construct_kv_params to use head_dim from config. Olmo2 models have an explicit head_dim field in their configuration, unlike Llama models where it needs to be calculated. :param huggingface_config: The HuggingFace configuration object. :param pipeline_config: The MAX Engine pipeline configuration. :param devices: Devices to use for the KV cache. :param kv_cache_config: Configuration for KV cache. :param cache_dtype: Data type for the cache.

Returns:

KVCacheParams object with the correct head_dim from config.

Parameters:

Return type:

KVCacheParams

finalize()​

finalize(huggingface_config, state_dict, return_logits, return_hidden_states=ReturnHiddenStates.NONE, norm_method='rms_norm', attention_bias=False)

source

Define parameters that can’t be determined just from the pipeline config.

Delegates to the parent Llama3Config.finalize() method.

Parameters:

  • huggingface_config (AutoConfig) – The HuggingFace model configuration object.
  • state_dict (dict[str, WeightData]) – The model’s state dictionary containing weights.
  • return_logits (ReturnLogits) – Whether to return the last token, all tokens or a variable number of logits.
  • return_hidden_states (ReturnHiddenStates) – Whether to return hidden states.
  • norm_method (Literal['rms_norm', 'layer_norm']) – The normalization method to use.
  • attention_bias (bool) – Whether to include bias in attention projections.

Return type:

None

get_head_dim()​

static get_head_dim(huggingface_config)

source

Parameters:

huggingface_config (AutoConfig)

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initializes an Olmo2Config instance from pipeline and HuggingFace configuration.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configuration, without needing the state_dict. Fields that depend on the state_dict (like tie_word_embeddings, quant_config) should be set via the finalize() method.

Overrides Llama3Config.initialize_from_config to use Olmo2-specific KV params and attention multiplier calculations.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – The HuggingFace model configuration object.
  • model_config (MAXModelConfig | None) – The MAX Engine model configuration.

Returns:

An initialized Olmo2Config instance.

Return type:

Self

Olmo2Model​

class max.pipelines.architectures.olmo2.Olmo2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: LlamaModelBase

OLMo2 pipeline model implementation.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

model​

model: Model

source

Compiled and initialized model ready for inference.

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.