For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.qwen3_embedding

Qwen3 architecture for embeddings generation.

`Qwen3EmbeddingConfig`

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingConfig(*, pipeline_config)

source

Bases: ArchConfig

Qwen3 embedding model configuration.

Parameters:: pipeline_config (PipelineConfig)

`get_max_seq_len()`

get_max_seq_len()

source

Returns the default maximum sequence length for the model.

Subclasses should determine whether this value can be overridden by setting the --max-length (pipeline_config.model.max_length) flag.

Return type:: int

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

pipeline_config (PipelineConfig) – The pipeline configuration.
model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

`pipeline_config`

pipeline_config: PipelineConfig

source

`Qwen3EmbeddingInputs`

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingInputs(tokens, input_row_offsets, return_n_logits, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

Input structure for Qwen3 embedding models.

Parameters:

tokens (Buffer)
input_row_offsets (Buffer)
return_n_logits (Buffer)
kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
lora_ids (Buffer | None)
lora_ranks (Buffer | None)
hidden_states (Buffer | list[Buffer] | None)

`input_row_offsets`

input_row_offsets: Buffer

source

Row offsets for ragged tensors [batch_size + 1]

`return_n_logits`

return_n_logits: Buffer

source

Number of logits to return (kept for interface compatibility)

`tokens`

tokens: Buffer

source

Input token IDs [total_seq_len]

`Qwen3EmbeddingModel`

class max.pipelines.architectures.qwen3_embedding.Qwen3EmbeddingModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL)

source

Bases: PipelineModel[TextContext]

Qwen3 embedding pipeline model without KV caching.

This model is optimized for embedding generation with:

No KV cache overhead
Single-pass forward computation
Flash attention without cache operations
Last token pooling with L2 normalization

Initialize the Qwen3 embedding pipeline model.

Parameters:

pipeline_config (PipelineConfig) – Pipeline configuration
session (InferenceSession) – Inference session
devices (list[Device]) – List of devices
kv_cache_config (KVCacheConfig) – KV cache configuration
weights (Weights) – Model weights
adapter (WeightsAdapter | None) – Optional weight adapter
return_logits (ReturnLogits) – Return logits mode

`attention_bias`

attention_bias: bool = False

source

Whether to use attention bias.

`calculate_max_seq_len()`

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length.

Parameters:

pipeline_config (PipelineConfig) – Pipeline configuration
huggingface_config (AutoConfig) – HuggingFace configuration

Returns:

Maximum sequence length

Return type:

int

`execute()`

execute(model_inputs)

source

Execute the model.

Parameters:: model_inputs (ModelInputs) – Model inputs
Returns:: Model outputs with embeddings in the logits field
Return type:: ModelOutputs

`model`

model: Model

source

Compiled and initialized model.

`norm_method`

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization method.

`prepare_initial_token_inputs()`

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare initial inputs for embedding generation.

Parameters:

replica_batches (Sequence[Sequence[TextContext]]) – Batches of text contexts
kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None) – Ignored (no KV cache for embeddings)
return_n_logits (int) – Number of logits (ignored for embeddings)

Returns:

Prepared inputs

Return type:

Qwen3EmbeddingInputs

`state_dict`

state_dict: dict[str, Any]

source

Model weights.

Qwen3EmbeddingConfig​

get_max_seq_len()​

initialize()​

pipeline_config​

Qwen3EmbeddingInputs​

input_row_offsets​

return_n_logits​

tokens​

Qwen3EmbeddingModel​

attention_bias​

calculate_max_seq_len()​

execute()​

model​

norm_method​

prepare_initial_token_inputs()​

state_dict​

`Qwen3EmbeddingConfig`

`get_max_seq_len()`

`initialize()`

`pipeline_config`

`Qwen3EmbeddingInputs`

`input_row_offsets`

`return_n_logits`

`tokens`

`Qwen3EmbeddingModel`

`attention_bias`

`calculate_max_seq_len()`

`execute()`

`model`

`norm_method`

`prepare_initial_token_inputs()`

`state_dict`