Python module
max.pipelines.architectures.qwen3vl_moe
Qwen3-VL vision-language architecture for multimodal text generation.
Qwen3VLConfigβ
class max.pipelines.architectures.qwen3vl_moe.Qwen3VLConfig(*, devices, dtype, image_token_id, video_token_id, vision_start_token_id, spatial_merge_size, mrope_section, num_experts, num_experts_per_tok, moe_intermediate_size, mlp_only_layers, norm_topk_prob, decoder_sparse_step, vision_config, llm_config)
Bases: ArchConfigWithKVCache
Configuration for Qwen3VL models.
-
Parameters:
-
- devices (list[DeviceRef])
- dtype (DType)
- image_token_id (int)
- video_token_id (int)
- vision_start_token_id (int)
- spatial_merge_size (int)
- mrope_section (list[int])
- num_experts (int)
- num_experts_per_tok (int)
- moe_intermediate_size (int)
- mlp_only_layers (list[int])
- norm_topk_prob (bool)
- decoder_sparse_step (int)
- vision_config (VisionConfig)
- llm_config (Llama3Config)
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculate maximum sequence length for Qwen3VL.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()β
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
decoder_sparse_stepβ
decoder_sparse_step: int
Sparse step for the decoder.
devicesβ
Devices that the Qwen3VL model is parallelized over.
dtypeβ
dtype: DType
DType of the Qwen3VL model weights.
finalize()β
finalize(huggingface_config, llm_state_dict, vision_state_dict, return_logits, norm_method='rms_norm')
Finalize the Qwen3VLConfig instance with state_dict dependent fields.
-
Parameters:
-
- huggingface_config (AutoConfig) β HuggingFace model configuration.
- llm_state_dict (dict[str, WeightData]) β Language model weights dictionary.
- vision_state_dict (dict[str, WeightData]) β Vision encoder weights dictionary.
- return_logits (ReturnLogits) β Return logits configuration.
- norm_method (Literal['rms_norm', 'layer_norm']) β Normalization method.
-
Return type:
-
None
get_kv_params()β
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()β
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
image_token_idβ
image_token_id: int
Token ID used for image placeholders in the input sequence.
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes a Qwen3VLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
A Qwen3VLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, huggingface_config)
Initializes a Qwen3VLConfig from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) β HuggingFace model configuration.
-
Returns:
-
A Qwen3VLConfig instance ready for finalization.
-
Return type:
llm_configβ
llm_config: Llama3Config
Language model configuration using Llama3 architecture.
mlp_only_layersβ
List of indices for the MLP only layers.
moe_intermediate_sizeβ
moe_intermediate_size: int
Intermediate size in the MoE layer.
mrope_sectionβ
List of indices for the mrope section.
norm_topk_probβ
norm_topk_prob: bool
Whether to use top-k probability normalization in the MoE layer.
num_expertsβ
num_experts: int
Number of experts in the MoE layer.
num_experts_per_tokβ
num_experts_per_tok: int
Number of experts per token in the MoE layer.
spatial_merge_sizeβ
spatial_merge_size: int
Size parameter for spatial merging of vision features.
video_token_idβ
video_token_id: int
Token ID used for video placeholders in the input sequence.
vision_configβ
vision_config: VisionConfig
Vision encoder configuration.
vision_start_token_idβ
vision_start_token_id: int
Token ID that marks the start of vision content.
Qwen3VLInputsβ
class max.pipelines.architectures.qwen3vl_moe.Qwen3VLInputs(tokens, input_row_offsets, signal_buffers, decoder_position_ids, return_n_logits, image_token_indices=None, pixel_values=None, vision_position_ids=None, weights=None, indices=None, max_grid_size=None, cu_seqlens=None, max_seqlen=None, grid_thw=None, *, kv_cache_inputs, lora_ids=None, lora_ranks=None, hidden_states=None)
Bases: ModelInputs
A class representing inputs for the Qwen3VL model.
This class encapsulates the input tensors required for the Qwen3VL model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (list[Buffer])
- signal_buffers (list[Buffer])
- decoder_position_ids (Buffer)
- return_n_logits (Buffer)
- image_token_indices (list[Buffer] | None)
- pixel_values (list[Buffer] | None)
- vision_position_ids (list[Buffer] | None)
- weights (list[Buffer] | None)
- indices (list[Buffer] | None)
- max_grid_size (list[Buffer] | None)
- cu_seqlens (list[Buffer] | None)
- max_seqlen (list[Buffer] | None)
- grid_thw (list[Buffer] | None)
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer])
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
cu_seqlensβ
Cumulative sequence lengths for full attention per device.
decoder_position_idsβ
decoder_position_ids: Buffer
3D RoPE position IDs for the decoder.
grid_thwβ
Grid dimensions (temporal, height, width) for each image/video, shape (n_images, 3) per device.
has_vision_inputsβ
property has_vision_inputs: bool
Check if this input contains vision data.
image_token_indicesβ
Per-device pre-computed multimodal merge indices for the image embeddings.
These are the locations of the image_token_id in the inputs fed to the model.
Some indices may be negative, which means that they are ignored by the multimodal merge.
indicesβ
Bilinear interpolation indices for vision position embeddings per device.
input_row_offsetsβ
Per-device tensors containing the offsets for each row in the ragged input sequence.
max_grid_sizeβ
Maximum grid size for vision inputs per device.
max_seqlenβ
Maximum sequence length for full attention for vision inputs per device.
pixel_valuesβ
Pixel values for vision inputs.
return_n_logitsβ
return_n_logits: Buffer
Number of logits to return, used by speculative decoding for example.
signal_buffersβ
Device buffers used for synchronization in communication collectives.
tokensβ
tokens: Buffer
Tensor containing the input token IDs.
vision_position_idsβ
Vision rotary position IDs per device.
weightsβ
Bilinear interpolation weights for vision position embeddings per device.
Qwen3VLModelβ
class max.pipelines.architectures.qwen3vl_moe.Qwen3VLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[Qwen3VLTextAndVisionContext]
A Qwen3VL pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the maximum sequence length for the Qwen3VL model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
estimate_activation_memory()β
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for model execution.
This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.
The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.
-
Parameters:
-
- pipeline_config (PipelineConfig) β Pipeline configuration
- huggingface_config (AutoConfig) β Hugging Face model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
execute()β
execute(model_inputs)
Executes the Qwen3VL model with the prepared inputs.
-
Parameters:
-
model_inputs (ModelInputs)
-
Return type:
get_kv_params()β
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Gets the parameters required to configure the KV cache for Qwen3VL.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_modelβ
language_model: Model
The compiled language model for text generation.
load_model()β
load_model(session)
Loads the compiled Qwen3VL models into the MAX Engine session.
-
Returns:
-
A tuple of (vision_model, language_model).
-
Parameters:
-
session (InferenceSession)
-
Return type:
model_configβ
model_config: Qwen3VLConfig | None
The Qwen3VL model configuration.
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs for the first execution pass of the Qwen3VL model.
-
Parameters:
-
Return type:
prepare_next_token_inputs()β
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the inputs for subsequent execution steps in a multi-step generation.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
vision_modelβ
vision_model: Model
The compiled vision model for processing images.
VisionConfigβ
class max.pipelines.architectures.qwen3vl_moe.VisionConfig(dtype, llm_dtype, devices, patch_size, temporal_patch_size, in_channels, hidden_size, num_attention_heads, depth, intermediate_size, out_hidden_size, deepstack_visual_indexes, rms_norm_eps, spatial_merge_size, num_position_embeddings)
Bases: object
Base configuration for Qwen3VL models with required fields.
-
Parameters:
-
- dtype (DType)
- llm_dtype (DType)
- devices (list[DeviceRef])
- patch_size (int)
- temporal_patch_size (int)
- in_channels (int)
- hidden_size (int)
- num_attention_heads (int)
- depth (int)
- intermediate_size (int)
- out_hidden_size (int)
- deepstack_visual_indexes (list[int])
- rms_norm_eps (float)
- spatial_merge_size (int)
- num_position_embeddings (int)
deepstack_visual_indexesβ
Indexes of the full attention blocks in the vision encoder.
depthβ
depth: int
Number of vision transformer layers.
devicesβ
Devices that the Qwen3VL vision encoder model is parallelized over.
dtypeβ
dtype: DType
DType of the Qwen3VL vision model weights.
finalize()β
finalize(vision_dtype, llm_dtype)
Finalize VisionConfig with state_dict dependent fields.
hidden_sizeβ
hidden_size: int
Hidden size of the vision encoder.
in_channelsβ
in_channels: int
Vision transformer number of input channels.
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, hf_vision_config)
Initialize VisionConfig from HuggingFace vision config.
Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- hf_vision_config (AutoConfig)
-
Return type:
intermediate_sizeβ
intermediate_size: int
Intermediate size in the vision encoderβs feed-forward layers.
llm_dtypeβ
llm_dtype: DType
DType of the Qwen3VL language model weights.
num_attention_headsβ
num_attention_heads: int
Number of attention heads in the vision encoder.
num_position_embeddingsβ
num_position_embeddings: int
Number of position embeddings for the vision encoder.
out_hidden_sizeβ
out_hidden_size: int
Output hidden size of the vision encoder. Also the hidden size of the language model.
patch_sizeβ
patch_size: int
Vision transformer patch size.
rms_norm_epsβ
rms_norm_eps: float
Epsilon for layer normalization.
spatial_merge_sizeβ
spatial_merge_size: int
Spatial merge size for the vision encoder.
temporal_patch_sizeβ
temporal_patch_size: int
Vision transformer temporal patch size.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!