Skip to main content

Python module

max.pipelines.architectures.qwen2

Qwen2 transformer architecture for text generation.

Qwen2Config

class max.pipelines.architectures.qwen2.Qwen2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, use_subgraphs=True, data_parallel_degree=1)

source

Bases: Llama3Config

Model configuration for Qwen2 graph construction/execution.

Parameters:

finalize()

finalize(huggingface_config, state_dict, return_logits, return_hidden_states=ReturnHiddenStates.NONE, norm_method='rms_norm', attention_bias=False)

source

Define parameters that can’t be determined just from the pipeline config.

Parameters:

Return type:

None

Qwen2Model

class max.pipelines.architectures.qwen2.Qwen2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: Llama3Model

Qwen2 pipeline model implementation.

Parameters:

attention_bias

attention_bias: bool = True

source

Whether to use attention bias.