Skip to main content

Python module

max.pipelines.architectures.qwen3_embedding

Qwen3 architecture for embeddings generation.

Qwen3EmbeddingConfig

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingConfig(*, pipeline_config)

source

Bases: ArchConfig

Qwen3 embedding model configuration.

Parameters:

pipeline_config (PipelineConfig)

get_max_seq_len()

get_max_seq_len()

source

Returns the default maximum sequence length for the model.

Subclasses should determine whether this value can be overridden by setting the --max-length (pipeline_config.model.max_length) flag.

Return type:

int

initialize()

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

pipeline_config

pipeline_config: PipelineConfig

source

Qwen3EmbeddingInputs

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingInputs(tokens, input_row_offsets, return_n_logits, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

Input structure for Qwen3 embedding models.

Parameters:

input_row_offsets

input_row_offsets: Buffer

source

Row offsets for ragged tensors [batch_size + 1]

return_n_logits

return_n_logits: Buffer

source

Number of logits to return (kept for interface compatibility)

tokens

tokens: Buffer

source

Input token IDs [total_seq_len]

Qwen3EmbeddingModel

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL)

source

Bases: PipelineModel[TextContext]

Qwen3 embedding pipeline model without KV caching.

This model is optimized for embedding generation with:

  • No KV cache overhead
  • Single-pass forward computation
  • Flash attention without cache operations
  • Last token pooling with L2 normalization

Initialize the Qwen3 embedding pipeline model.

Parameters:

attention_bias

attention_bias: bool = False

source

Whether to use attention bias.

calculate_max_seq_len()

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – HuggingFace configuration

Returns:

Maximum sequence length

Return type:

int

execute()

execute(model_inputs)

source

Execute the model.

Parameters:

model_inputs (ModelInputs) – Model inputs

Returns:

Model outputs with embeddings in the logits field

Return type:

ModelOutputs

model

model: Model

source

Compiled and initialized model.

norm_method

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization method.

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare initial inputs for embedding generation.

Parameters:

Returns:

Prepared inputs

Return type:

Qwen3EmbeddingInputs

prepare_next_token_inputs()

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepare next token inputs (not supported for embedding models).

Parameters:

  • next_tokens (Buffer) – Next tokens
  • prev_model_inputs (ModelInputs) – Previous inputs

Raises:

NotImplementedError – Embedding models don’t support autoregressive generation

Return type:

Qwen3EmbeddingInputs

state_dict

state_dict: dict[str, Any]

source

Model weights.