Skip to main content

Python module

max.pipelines.architectures.qwen3vl_moe

Qwen3-VL vision-language architecture for multimodal text generation.

Qwen3VLConfig​

class max.pipelines.architectures.qwen3vl_moe.Qwen3VLConfig(*, devices, dtype, image_token_id, video_token_id, vision_start_token_id, spatial_merge_size, mrope_section, num_experts, num_experts_per_tok, moe_intermediate_size, mlp_only_layers, norm_topk_prob, decoder_sparse_step, vision_config, llm_config)

source

Bases: ArchConfigWithKVCache

Configuration for Qwen3VL models.

Parameters:

calculate_max_seq_len()​

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length for Qwen3VL.

Parameters:

Return type:

int

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Parameters:

Return type:

KVCacheParams

decoder_sparse_step​

decoder_sparse_step: int

source

Sparse step for the decoder.

devices​

devices: list[DeviceRef]

source

Devices that the Qwen3VL model is parallelized over.

dtype​

dtype: DType

source

DType of the Qwen3VL model weights.

finalize()​

finalize(huggingface_config, llm_state_dict, vision_state_dict, return_logits, norm_method='rms_norm')

source

Finalize the Qwen3VLConfig instance with state_dict dependent fields.

Parameters:

  • huggingface_config (AutoConfig) – HuggingFace model configuration.
  • llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
  • vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
  • return_logits (ReturnLogits) – Return logits configuration.
  • norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

get_kv_params()​

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:

KVCacheParams

get_max_seq_len()​

get_max_seq_len()

source

Returns the maximum sequence length from the embedded LLM config.

Return type:

int

get_num_layers()​

static get_num_layers(huggingface_config)

source

Parameters:

huggingface_config (AutoConfig)

Return type:

int

image_token_id​

image_token_id: int

source

Token ID used for image placeholders in the input sequence.

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a Qwen3VLConfig instance from pipeline configuration.

Parameters:

Returns:

A Qwen3VLConfig instance with fields initialized from config.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes a Qwen3VLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

A Qwen3VLConfig instance ready for finalization.

Return type:

Self

llm_config​

llm_config: Llama3Config

source

Language model configuration using Llama3 architecture.

mlp_only_layers​

mlp_only_layers: list[int]

source

List of indices for the MLP only layers.

moe_intermediate_size​

moe_intermediate_size: int

source

Intermediate size in the MoE layer.

mrope_section​

mrope_section: list[int]

source

List of indices for the mrope section.

norm_topk_prob​

norm_topk_prob: bool

source

Whether to use top-k probability normalization in the MoE layer.

num_experts​

num_experts: int

source

Number of experts in the MoE layer.

num_experts_per_tok​

num_experts_per_tok: int

source

Number of experts per token in the MoE layer.

spatial_merge_size​

spatial_merge_size: int

source

Size parameter for spatial merging of vision features.

video_token_id​

video_token_id: int

source

Token ID used for video placeholders in the input sequence.

vision_config​

vision_config: VisionConfig

source

Vision encoder configuration.

vision_start_token_id​

vision_start_token_id: int

source

Token ID that marks the start of vision content.

Qwen3VLInputs​

class max.pipelines.architectures.qwen3vl_moe.Qwen3VLInputs(tokens, input_row_offsets, signal_buffers, decoder_position_ids, return_n_logits, image_token_indices=None, pixel_values=None, vision_position_ids=None, weights=None, indices=None, max_grid_size=None, cu_seqlens=None, max_seqlen=None, grid_thw=None, *, kv_cache_inputs, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the Qwen3VL model.

This class encapsulates the input tensors required for the Qwen3VL model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.

Parameters:

cu_seqlens​

cu_seqlens: list[Buffer] | None = None

source

Cumulative sequence lengths for full attention per device.

decoder_position_ids​

decoder_position_ids: Buffer

source

3D RoPE position IDs for the decoder.

grid_thw​

grid_thw: list[Buffer] | None = None

source

Grid dimensions (temporal, height, width) for each image/video, shape (n_images, 3) per device.

has_vision_inputs​

property has_vision_inputs: bool

source

Check if this input contains vision data.

image_token_indices​

image_token_indices: list[Buffer] | None = None

source

Per-device pre-computed multimodal merge indices for the image embeddings.

These are the locations of the image_token_id in the inputs fed to the model.

Some indices may be negative, which means that they are ignored by the multimodal merge.

indices​

indices: list[Buffer] | None = None

source

Bilinear interpolation indices for vision position embeddings per device.

input_row_offsets​

input_row_offsets: list[Buffer]

source

Per-device tensors containing the offsets for each row in the ragged input sequence.

max_grid_size​

max_grid_size: list[Buffer] | None = None

source

Maximum grid size for vision inputs per device.

max_seqlen​

max_seqlen: list[Buffer] | None = None

source

Maximum sequence length for full attention for vision inputs per device.

pixel_values​

pixel_values: list[Buffer] | None = None

source

Pixel values for vision inputs.

return_n_logits​

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

signal_buffers​

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

tokens​

tokens: Buffer

source

Tensor containing the input token IDs.

vision_position_ids​

vision_position_ids: list[Buffer] | None = None

source

Vision rotary position IDs per device.

weights​

weights: list[Buffer] | None = None

source

Bilinear interpolation weights for vision position embeddings per device.

Qwen3VLModel​

class max.pipelines.architectures.qwen3vl_moe.Qwen3VLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[Qwen3VLTextAndVisionContext]

A Qwen3VL pipeline model for multimodal text generation.

Parameters:

calculate_max_seq_len()​

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the maximum sequence length for the Qwen3VL model.

Parameters:

Return type:

int

estimate_activation_memory()​

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for model execution.

This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.

The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – Hugging Face model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

execute()​

execute(model_inputs)

source

Executes the Qwen3VL model with the prepared inputs.

Parameters:

model_inputs (ModelInputs)

Return type:

ModelOutputs

get_kv_params()​

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Gets the parameters required to configure the KV cache for Qwen3VL.

Parameters:

Return type:

KVCacheParams

language_model​

language_model: Model

source

The compiled language model for text generation.

load_model()​

load_model(session)

source

Loads the compiled Qwen3VL models into the MAX Engine session.

Returns:

A tuple of (vision_model, language_model).

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]

model_config​

model_config: Qwen3VLConfig | None

source

The Qwen3VL model configuration.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the Qwen3VL model.

Parameters:

Return type:

Qwen3VLInputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the inputs for subsequent execution steps in a multi-step generation.

Parameters:

Return type:

Qwen3VLInputs

vision_model​

vision_model: Model

source

The compiled vision model for processing images.

VisionConfig​

class max.pipelines.architectures.qwen3vl_moe.VisionConfig(dtype, llm_dtype, devices, patch_size, temporal_patch_size, in_channels, hidden_size, num_attention_heads, depth, intermediate_size, out_hidden_size, deepstack_visual_indexes, rms_norm_eps, spatial_merge_size, num_position_embeddings)

source

Bases: object

Base configuration for Qwen3VL models with required fields.

Parameters:

  • dtype (DType)
  • llm_dtype (DType)
  • devices (list[DeviceRef])
  • patch_size (int)
  • temporal_patch_size (int)
  • in_channels (int)
  • hidden_size (int)
  • num_attention_heads (int)
  • depth (int)
  • intermediate_size (int)
  • out_hidden_size (int)
  • deepstack_visual_indexes (list[int])
  • rms_norm_eps (float)
  • spatial_merge_size (int)
  • num_position_embeddings (int)

deepstack_visual_indexes​

deepstack_visual_indexes: list[int]

source

Indexes of the full attention blocks in the vision encoder.

depth​

depth: int

source

Number of vision transformer layers.

devices​

devices: list[DeviceRef]

source

Devices that the Qwen3VL vision encoder model is parallelized over.

dtype​

dtype: DType

source

DType of the Qwen3VL vision model weights.

finalize()​

finalize(vision_dtype, llm_dtype)

source

Finalize VisionConfig with state_dict dependent fields.

Parameters:

Return type:

None

hidden_size​

hidden_size: int

source

Hidden size of the vision encoder.

in_channels​

in_channels: int

source

Vision transformer number of input channels.

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, hf_vision_config)

source

Initialize VisionConfig from HuggingFace vision config.

Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.

Parameters:

Return type:

VisionConfig

intermediate_size​

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

llm_dtype​

llm_dtype: DType

source

DType of the Qwen3VL language model weights.

num_attention_heads​

num_attention_heads: int

source

Number of attention heads in the vision encoder.

num_position_embeddings​

num_position_embeddings: int

source

Number of position embeddings for the vision encoder.

out_hidden_size​

out_hidden_size: int

source

Output hidden size of the vision encoder. Also the hidden size of the language model.

patch_size​

patch_size: int

source

Vision transformer patch size.

rms_norm_eps​

rms_norm_eps: float

source

Epsilon for layer normalization.

spatial_merge_size​

spatial_merge_size: int

source

Spatial merge size for the vision encoder.

temporal_patch_size​

temporal_patch_size: int

source

Vision transformer temporal patch size.