Python module
max.pipelines.architectures.qwen2_5vl
Qwen2.5-VL vision-language architecture for multimodal text generation.
Qwen2_5VLConfigβ
class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLConfig(*, devices, image_token_id, video_token_id, vision_start_token_id, spatial_merge_size, tokens_per_second, mrope_section, vision_config, llm_config)
Bases: ArchConfigWithKVCache
Configuration for Qwen2.5VL models.
-
Parameters:
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculate maximum sequence length for Qwen2.5VL.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()β
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devicesβ
Devices that the Qwen2.5VL model is parallelized over.
finalize()β
finalize(huggingface_config, pipeline_config, llm_state_dict, vision_state_dict, return_logits, norm_method='rms_norm')
Finalize the Qwen2_5VLConfig instance with state_dict dependent fields.
-
Parameters:
-
- huggingface_config (AutoConfig) β HuggingFace model configuration.
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- llm_state_dict (dict[str, WeightData]) β Language model weights dictionary.
- vision_state_dict (dict[str, WeightData]) β Vision encoder weights dictionary.
- return_logits (ReturnLogits) β Return logits configuration.
- norm_method (Literal['rms_norm', 'layer_norm']) β Normalization method.
-
Return type:
-
None
get_kv_params()β
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()β
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
image_token_idβ
image_token_id: int
Token ID used for image placeholders in the input sequence.
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes a Qwen2_5VLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
A Qwen2_5VLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, huggingface_config)
Initializes a Qwen2_5VLConfig from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) β HuggingFace model configuration.
-
Returns:
-
A Qwen2_5VLConfig instance ready for finalization.
-
Return type:
llm_configβ
llm_config: Llama3Config
Language model configuration using Llama3 architecture.
mrope_sectionβ
List of indices for the mrope section.
spatial_merge_sizeβ
spatial_merge_size: int
Size parameter for spatial merging of vision features.
tokens_per_secondβ
tokens_per_second: int
Number of tokens per second.
video_token_idβ
video_token_id: int
Token ID used for video placeholders in the input sequence.
vision_configβ
vision_config: VisionConfig
Vision encoder configuration.
vision_start_token_idβ
vision_start_token_id: int
Token ID that marks the start of vision content.
Qwen2_5VLInputsβ
class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLInputs(tokens, input_row_offsets, signal_buffers, position_ids, return_n_logits, image_token_indices=None, pixel_values=None, window_index=None, vision_position_ids=None, max_grid_size=None, cu_seqlens=None, cu_window_seqlens=None, max_seqlen=None, max_window_seqlen=None, *, kv_cache_inputs, lora_ids=None, lora_ranks=None, hidden_states=None)
Bases: ModelInputs
A class representing inputs for the Qwen2.5VL model.
This class encapsulates the input tensors required for the Qwen2.5VL model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (list[Buffer])
- signal_buffers (list[Buffer])
- position_ids (Buffer)
- return_n_logits (Buffer)
- image_token_indices (list[Buffer] | None)
- pixel_values (list[Buffer] | None)
- window_index (list[Buffer] | None)
- vision_position_ids (list[Buffer] | None)
- max_grid_size (list[Buffer] | None)
- cu_seqlens (list[Buffer] | None)
- cu_window_seqlens (list[Buffer] | None)
- max_seqlen (list[Buffer] | None)
- max_window_seqlen (list[Buffer] | None)
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer])
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
cu_seqlensβ
Cumulative sequence lengths for full attention.
cu_window_seqlensβ
Cumulative window sequence lengths for window attention.
has_vision_inputsβ
property has_vision_inputs: bool
Check if this input contains vision data.
image_token_indicesβ
Per-device pre-computed multimodal merge indices for the image embeddings.
These are the locations of the image_token_id in the inputs fed to the model.
Some indices may be negative, which means that they are ignored by the multimodal merge.
input_row_offsetsβ
Per-device tensors containing the offsets for each row in the ragged input sequence.
max_grid_sizeβ
Maximum grid size for vision inputs.
max_seqlenβ
Maximum sequence length for full attention for vision inputs.
max_window_seqlenβ
Maximum sequence length for window attention for vision inputs.
pixel_valuesβ
Pixel values for vision inputs.
position_idsβ
position_ids: Buffer
3D RoPE position IDs for the decoder.
return_n_logitsβ
return_n_logits: Buffer
Number of logits to return, used by speculative decoding for example.
signal_buffersβ
Device buffers used for synchronization in communication collectives.
tokensβ
tokens: Buffer
Tensor containing the input token IDs.
vision_position_idsβ
1D RoPE position IDs for the visual inputs.
window_indexβ
Window indices for vision attention mechanism.
Qwen2_5VLModelβ
class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]
A Qwen2.5VL pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the maximum sequence length for the Qwen2.5VL model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
estimate_activation_memory()β
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for model execution.
This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.
The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.
-
Parameters:
-
- pipeline_config (PipelineConfig) β Pipeline configuration
- huggingface_config (AutoConfig) β Hugging Face model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
execute()β
execute(model_inputs)
Executes the Qwen2.5VL model with the prepared inputs.
-
Parameters:
-
model_inputs (ModelInputs)
-
Return type:
get_kv_params()β
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Gets the parameters required to configure the KV cache for Qwen2.5VL.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_modelβ
language_model: Model
The compiled language model for text generation.
load_model()β
load_model(session)
Loads the compiled Qwen2.5VL models into the MAX Engine session.
-
Returns:
-
A tuple of (vision_model, language_model).
-
Parameters:
-
session (InferenceSession)
-
Return type:
model_configβ
model_config: Qwen2_5VLConfig | None
The Qwen2.5VL model configuration.
prepare_decoder_position_ids()β
static prepare_decoder_position_ids(context_batch, devices)
Prepare decoder position IDs for a batch of contexts.
This function computes position IDs for decoder tokens, handling three cases:
- Vision encoding with pre-computed position IDs (use stored values)
- Vision encoding requiring recomputation (after preemption)
- Text-only generation (simple arange with offset)
Optimized implementation: pre-allocates output array and writes directly, avoiding concatenation overhead for better performance.
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs for the first execution pass of the Qwen2.5VL model.
-
Parameters:
-
Return type:
prepare_next_token_inputs()β
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the inputs for subsequent execution steps in a multi-step generation.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
vision_modelβ
vision_model: Model
The compiled vision model for processing images.
VisionConfigβ
class max.pipelines.architectures.qwen2_5vl.VisionConfig(dtype, llm_dtype, devices, patch_size, temporal_patch_size, in_channels, hidden_size, num_attention_heads, depth, intermediate_size, out_hidden_size, fullatt_block_indexes, rms_norm_eps, window_size, spatial_merge_size, quant_config=None)
Bases: object
Base configuration for Qwen2.5VL models with required fields.
-
Parameters:
-
- dtype (DType)
- llm_dtype (DType)
- devices (list[DeviceRef])
- patch_size (int)
- temporal_patch_size (int)
- in_channels (int)
- hidden_size (int)
- num_attention_heads (int)
- depth (int)
- intermediate_size (int)
- out_hidden_size (int)
- fullatt_block_indexes (list[int])
- rms_norm_eps (float)
- window_size (int)
- spatial_merge_size (int)
- quant_config (QuantConfig | None)
depthβ
depth: int
Number of vision transformer layers.
devicesβ
Devices that the Qwen2.5VL vision encoder model is parallelized over.
dtypeβ
dtype: DType
DType of the Qwen2.5VL vision model weights.
finalize()β
finalize(huggingface_config, vision_state_dict, vision_dtype, llm_dtype)
Finalize VisionConfig with state_dict dependent fields.
-
Parameters:
-
- huggingface_config (AutoConfig)
- vision_state_dict (dict[str, WeightData])
- vision_dtype (DType)
- llm_dtype (DType)
-
Return type:
-
None
fullatt_block_indexesβ
Indexes of the full attention blocks in the vision encoder.
hidden_sizeβ
hidden_size: int
Hidden size of the vision encoder.
in_channelsβ
in_channels: int
Vision transformer number of input channels.
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, hf_vision_config)
Initialize VisionConfig from HuggingFace vision config.
Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- hf_vision_config (AutoConfig)
-
Return type:
intermediate_sizeβ
intermediate_size: int
Intermediate size in the vision encoderβs feed-forward layers.
llm_dtypeβ
llm_dtype: DType
DType of the Qwen2.5VL language model weights.
num_attention_headsβ
num_attention_heads: int
Number of attention heads in the vision encoder.
out_hidden_sizeβ
out_hidden_size: int
Output hidden size of the vision encoder. Also the hidden size of the language model.
patch_sizeβ
patch_size: int
Vision transformer patch size.
quant_configβ
quant_config: QuantConfig | None = None
Scaled quantization configuration for the vision encoder.
rms_norm_epsβ
rms_norm_eps: float
Epsilon for layer normalization.
spatial_merge_sizeβ
spatial_merge_size: int
Spatial merge size for the vision encoder.
temporal_patch_sizeβ
temporal_patch_size: int
Vision transformer temporal patch size.
window_sizeβ
window_size: int
Window size for the vision encoder.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!