IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.qwen3_embedding

Qwen3 architecture for embeddings generation.

Qwen3EmbeddingConfig​

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingConfig(*, pipeline_config)

source

Bases: ArchConfig

Qwen3 embedding model configuration.

Parameters:

pipeline_config (PipelineConfig)

get_max_seq_len()​

get_max_seq_len()

source

Returns the default maximum sequence length for the model.

Subclasses should determine whether this value can be overridden by setting the --max-length (pipeline_config.model.max_length) flag.

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

pipeline_config​

pipeline_config: PipelineConfig

source

Qwen3EmbeddingInputs​

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingInputs(tokens, input_row_offsets, return_n_logits, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

Input structure for Qwen3 embedding models.

Parameters:

input_row_offsets​

input_row_offsets: Buffer

source

Row offsets for ragged tensors [batch_size + 1]

return_n_logits​

return_n_logits: Buffer

source

Number of logits to return (kept for interface compatibility)

tokens​

tokens: Buffer

source

Input token IDs [total_seq_len]

Qwen3EmbeddingModel​

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL)

source

Bases: PipelineModel[TextContext]

Qwen3 embedding pipeline model without KV caching.

This model is optimized for embedding generation with:

  • No KV cache overhead
  • Single-pass forward computation
  • Flash attention without cache operations
  • Last token pooling with L2 normalization

Initialize the Qwen3 embedding pipeline model.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

calculate_max_seq_len()​

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – HuggingFace configuration

Returns:

Maximum sequence length

Return type:

int

execute()​

execute(model_inputs)

source

Execute the model.

Parameters:

model_inputs (ModelInputs) – Model inputs

Returns:

Model outputs with embeddings in the logits field

Return type:

ModelOutputs

model​

model: Model

source

Compiled and initialized model.

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization method.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare initial inputs for embedding generation.

Parameters:

Returns:

Prepared inputs

Return type:

Qwen3EmbeddingInputs

state_dict​

state_dict: dict[str, Any]

source

Model weights.