Python module
max.pipelines.architectures.olmo2
OLMo 2 transformer architecture for text generation.
Olmo2Config
class max.pipelines.architectures.olmo2.Olmo2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, use_subgraphs=True, data_parallel_degree=1)
Bases: Llama3Config
Implementation of MAXModelConfig for Olmo2 models. Olmo2 models use a different approach for head_dim calculation compared to Llama3. Llama3 calculates head_dim as hidden_size // num_attention_heads, Olmo2 models have an explicit head_dim field in their configuration.
-
Parameters:
-
- hidden_size (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- num_hidden_layers (int)
- rope_theta (float)
- rope_scaling_params (Llama3RopeScalingParams | None)
- max_seq_len (int)
- intermediate_size (int)
- interleaved_rope_weights (bool)
- vocab_size (int)
- dtype (DType)
- model_quantization_encoding (QuantizationEncoding | None)
- quantization_config (QuantizationConfig | None)
- kv_params (KVCacheParams)
- return_logits (ReturnLogits)
- norm_method (Literal['rms_norm', 'layer_norm'])
- norm_dtype (DType | None)
- attention_bias (bool)
- rms_norm_eps (float | None)
- tie_word_embeddings (bool)
- stacked_mlp (bool)
- stacked_qkv (bool)
- attention_multiplier (float)
- embedding_multiplier (float)
- residual_multiplier (float)
- devices (list[DeviceRef])
- clip_qkv (float | None)
- quant_config (QuantConfig | None)
- lora_config (LoRAConfig | None)
- longrope_scaling_params (LongRoPEScalingParams | None)
- logits_scaling (float)
- return_hidden_states (ReturnHiddenStates)
- use_subgraphs (bool)
- data_parallel_degree (int)
calculate_attention_multiplier()
static calculate_attention_multiplier(huggingface_config)
The attention multiplier for Olmo2 models. Uses the explicit head_dim from the config instead of calculating it. :param huggingface_config: The HuggingFace configuration object.
-
Returns:
-
The attention multiplier value.
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Override the default Llama3Config.construct_kv_params to use head_dim from config. Olmo2 models have an explicit head_dim field in their configuration, unlike Llama models where it needs to be calculated. :param huggingface_config: The HuggingFace configuration object. :param pipeline_config: The MAX Engine pipeline configuration. :param devices: Devices to use for the KV cache. :param kv_cache_config: Configuration for KV cache. :param cache_dtype: Data type for the cache.
-
Returns:
-
KVCacheParams object with the correct head_dim from config.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
finalize()
finalize(huggingface_config, state_dict, return_logits, return_hidden_states=ReturnHiddenStates.NONE, norm_method='rms_norm', attention_bias=False)
Define parameters that can’t be determined just from the pipeline config.
Delegates to the parent Llama3Config.finalize() method.
-
Parameters:
-
- huggingface_config (AutoConfig) – The HuggingFace model configuration object.
- state_dict (dict[str, WeightData]) – The model’s state dictionary containing weights.
- return_logits (ReturnLogits) – Whether to return the last token, all tokens or a variable number of logits.
- return_hidden_states (ReturnHiddenStates) – Whether to return hidden states.
- norm_method (Literal['rms_norm', 'layer_norm']) – The normalization method to use.
- attention_bias (bool) – Whether to include bias in attention projections.
-
Return type:
-
None
get_head_dim()
static get_head_dim(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
initialize()
classmethod initialize(pipeline_config, model_config=None)
Initialize the config from a PipelineConfig.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The pipeline configuration.
- model_config (MAXModelConfig | None) – The model configuration to read from. When
None(the default),pipeline_config.modelis used. Pass an explicit config (e.g.pipeline_config.draft_model) to initialize the arch config for a different model.
-
Return type:
initialize_from_config()
classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)
Initializes an Olmo2Config instance from pipeline and HuggingFace configuration.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configuration, without needing the state_dict. Fields that depend on the state_dict (like tie_word_embeddings, quant_config) should be set via the finalize() method.
Overrides Llama3Config.initialize_from_config to use Olmo2-specific KV params and attention multiplier calculations.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) – The HuggingFace model configuration object.
- model_config (MAXModelConfig | None) – The MAX Engine model configuration.
-
Returns:
-
An initialized Olmo2Config instance.
-
Return type:
Olmo2Model
class max.pipelines.architectures.olmo2.Olmo2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)
Bases: LlamaModelBase
OLMo2 pipeline model implementation.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The configuration for this pipeline.
- session (InferenceSession) – The container for the runtime for this model.
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
attention_bias
attention_bias: bool = False
Whether to use attention bias.
model
model: Model
Compiled and initialized model ready for inference.
norm_method
norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'
Normalization layer.
state_dict
Weights to load into the model.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!