For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.qwen2_5vl

Qwen2.5-VL vision-language architecture for multimodal text generation.

`Qwen2_5VLConfig`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLConfig(*, devices, image_token_id, video_token_id, vision_start_token_id, spatial_merge_size, tokens_per_second, mrope_section, vision_config, llm_config)

source

Bases: ArchVLConfigWithTextSubconfig, ArchConfigWithKVCache

Configuration for Qwen2.5VL models.

Parameters:

devices (list[DeviceRef])
image_token_id (int)
video_token_id (int)
vision_start_token_id (int)
spatial_merge_size (int)
tokens_per_second (int)
mrope_section (list[int])
vision_config (VisionConfig)
llm_config (Llama3Config)

`devices`

devices: list[DeviceRef]

source

Devices that the Qwen2.5VL model is parallelized over.

`finalize()`

finalize(huggingface_config, pipeline_config, llm_state_dict, vision_state_dict, return_logits, norm_method='rms_norm')

source

Finalize the Qwen2_5VLConfig instance with state_dict dependent fields.

Parameters:

huggingface_config (AutoConfig) – HuggingFace model configuration.
pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
return_logits (ReturnLogits) – Return logits configuration.
norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

`get_kv_params()`

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:: KVCacheParams

`get_num_layers()`

static get_num_layers(huggingface_config)

source

Parameters:: huggingface_config (AutoConfig)
Return type:: int

`image_token_id`

image_token_id: int

source

Token ID used for image placeholders in the input sequence.

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a Qwen2_5VLConfig instance from pipeline configuration.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
model_config (MAXModelConfig | None)

Returns:

A Qwen2_5VLConfig instance with fields initialized from config.

Return type:

Self

`initialize_from_config()`

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes a Qwen2_5VLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

A Qwen2_5VLConfig instance ready for finalization.

Return type:

Self

`llm_config`

llm_config: Llama3Config

source

Language model configuration using Llama3 architecture.

`mrope_section`

mrope_section: list[int]

source

List of indices for the mrope section.

`spatial_merge_size`

spatial_merge_size: int

source

Size parameter for spatial merging of vision features.

`tokens_per_second`

tokens_per_second: int

source

Number of tokens per second.

`video_token_id`

video_token_id: int

source

Token ID used for video placeholders in the input sequence.

`vision_config`

vision_config: VisionConfig

source

Vision encoder configuration.

`vision_start_token_id`

vision_start_token_id: int

source

Token ID that marks the start of vision content.

`Qwen2_5VLInputs`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLInputs(tokens, input_row_offsets, signal_buffers, position_ids, return_n_logits, image_token_indices=None, pixel_values=None, window_index=None, vision_position_ids=None, max_grid_size=None, cu_seqlens=None, cu_window_seqlens=None, max_seqlen=None, max_window_seqlen=None, *, kv_cache_inputs, lora=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the Qwen2.5VL model.

This class encapsulates the input tensors required for the Qwen2.5VL model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.

Parameters:

tokens (Buffer)
input_row_offsets (list[Buffer])
signal_buffers (list[Buffer])
position_ids (Buffer)
return_n_logits (Buffer)
image_token_indices (list[Buffer] | None)
pixel_values (list[Buffer] | None)
window_index (list[Buffer] | None)
vision_position_ids (list[Buffer] | None)
max_grid_size (list[Buffer] | None)
cu_seqlens (list[Buffer] | None)
cu_window_seqlens (list[Buffer] | None)
max_seqlen (list[Buffer] | None)
max_window_seqlen (list[Buffer] | None)
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer])
lora (LoRAInputs | None)
hidden_states (Buffer | list[Buffer] | None)

`cu_seqlens`

cu_seqlens: list[Buffer] | None = None

source

Cumulative sequence lengths for full attention.

`cu_window_seqlens`

cu_window_seqlens: list[Buffer] | None = None

source

Cumulative window sequence lengths for window attention.

`has_vision_inputs`

property has_vision_inputs: bool

source

Check if this input contains vision data.

`image_token_indices`

image_token_indices: list[Buffer] | None = None

source

Per-device pre-computed multimodal merge indices for the image embeddings.

These are the locations of the image_token_id in the inputs fed to the model.

Some indices may be negative, which means that they are ignored by the multimodal merge.

`input_row_offsets`

input_row_offsets: list[Buffer]

source

Per-device tensors containing the offsets for each row in the ragged input sequence.

`max_grid_size`

max_grid_size: list[Buffer] | None = None

source

Maximum grid size for vision inputs.

`max_seqlen`

max_seqlen: list[Buffer] | None = None

source

Maximum sequence length for full attention for vision inputs.

`max_window_seqlen`

max_window_seqlen: list[Buffer] | None = None

source

Maximum sequence length for window attention for vision inputs.

`pixel_values`

pixel_values: list[Buffer] | None = None

source

Pixel values for vision inputs.

`position_ids`

position_ids: Buffer

source

3D RoPE position IDs for the decoder.

`return_n_logits`

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

`signal_buffers`

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

`tokens`

tokens: Buffer

source

Tensor containing the input token IDs.

`vision_position_ids`

vision_position_ids: list[Buffer] | None = None

source

1D RoPE position IDs for the visual inputs.

`window_index`

window_index: list[Buffer] | None = None

source

Window indices for vision attention mechanism.

`Qwen2_5VLModel`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, max_batch_size=1)

source

Bases: AlwaysSignalBuffersMixin, MultiGraphPipelineModelWithKVCache[TextAndVisionContext]

A Qwen2.5VL pipeline model for multimodal text generation.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)
max_batch_size (int)

`batch_processor_cls`

batch_processor_cls

source

alias of Qwen2_5VLBatchProcessor

`execute()`

execute(model_inputs)

source

Executes the Qwen2.5VL model with the prepared inputs.

Parameters:: model_inputs (ModelInputs)
Return type:: ModelOutputs

`language_model`

language_model: Model

source

The compiled language model for text generation.

`load_model()`

load_model(session)

source

Override: incompatible tower graph capture signature.

_build_* is (module) -> Graph (not the base (config, state_dict, module) -> (Graph, registry)), because graphs are captured from a pre-instantiated Qwen2_5VL module after load_state_dict in _create_model_config. Registry keys come from the tower splits in _load_state_dict, not from per-tower nn.state_dict() returns.

Parameters:: session (InferenceSession)
Return type:: tuple[Model | None, Model]

`model_config`

model_config: Qwen2_5VLConfig | None

source

The Qwen2.5VL model configuration.

`model_config_cls`

model_config_cls

source

alias of Qwen2_5VLConfig

`vision_model`

vision_model: Model | None

source

The compiled vision model for processing images.

`VisionConfig`

class max.pipelines.architectures.qwen2_5vl.VisionConfig(dtype, llm_dtype, devices, patch_size, temporal_patch_size, in_channels, hidden_size, num_attention_heads, depth, intermediate_size, out_hidden_size, fullatt_block_indexes, rms_norm_eps, window_size, spatial_merge_size, quant_config=None)

source

Bases: object

Base configuration for Qwen2.5VL models with required fields.

Parameters:

dtype (DType)
llm_dtype (DType)
devices (list[DeviceRef])
patch_size (int)
temporal_patch_size (int)
in_channels (int)
hidden_size (int)
num_attention_heads (int)
depth (int)
intermediate_size (int)
out_hidden_size (int)
fullatt_block_indexes (list[int])
rms_norm_eps (float)
window_size (int)
spatial_merge_size (int)
quant_config (QuantConfig | None)

`depth`

depth: int

source

Number of vision transformer layers.

`devices`

devices: list[DeviceRef]

source

Devices that the Qwen2.5VL vision encoder model is parallelized over.

`dtype`

dtype: DType

source

DType of the Qwen2.5VL vision model weights.

`finalize()`

finalize(huggingface_config, vision_state_dict, vision_dtype, llm_dtype)

source

Finalize VisionConfig with state_dict dependent fields.

Parameters:

huggingface_config (AutoConfig)
vision_state_dict (dict[str, WeightData])
vision_dtype (DType)
llm_dtype (DType)

Return type:

None

`fullatt_block_indexes`

fullatt_block_indexes: list[int]

source

Indexes of the full attention blocks in the vision encoder.

`hidden_size`

hidden_size: int

source

Hidden size of the vision encoder.

`in_channels`

in_channels: int

source

Vision transformer number of input channels.

`initialize_from_config()`

classmethod initialize_from_config(pipeline_config, hf_vision_config)

source

Initialize VisionConfig from HuggingFace vision config.

Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.

Parameters:

pipeline_config (PipelineConfig)
hf_vision_config (AutoConfig)

Return type:

VisionConfig

`intermediate_size`

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

`llm_dtype`

llm_dtype: DType

source

DType of the Qwen2.5VL language model weights.

`num_attention_heads`

num_attention_heads: int

source

Number of attention heads in the vision encoder.

`out_hidden_size`

out_hidden_size: int

source

Output hidden size of the vision encoder. Also the hidden size of the language model.

`patch_size`

patch_size: int

source

Vision transformer patch size.

`quant_config`

quant_config: QuantConfig | None = None

source

Scaled quantization configuration for the vision encoder.

`rms_norm_eps`

rms_norm_eps: float

source

Epsilon for layer normalization.

`spatial_merge_size`

spatial_merge_size: int

source

Spatial merge size for the vision encoder.

`temporal_patch_size`

temporal_patch_size: int

source

Vision transformer temporal patch size.

`window_size`

window_size: int

source

Window size for the vision encoder.

Qwen2_5VLConfig​

devices​

finalize()​

get_kv_params()​

get_num_layers()​

image_token_id​

initialize()​

initialize_from_config()​

llm_config​

mrope_section​

spatial_merge_size​

tokens_per_second​

video_token_id​

vision_config​

vision_start_token_id​

Qwen2_5VLInputs​

cu_seqlens​

cu_window_seqlens​

has_vision_inputs​

image_token_indices​

input_row_offsets​

max_grid_size​

max_seqlen​

max_window_seqlen​

pixel_values​

position_ids​

return_n_logits​

signal_buffers​

tokens​

vision_position_ids​

window_index​

Qwen2_5VLModel​

batch_processor_cls​

execute()​

language_model​

load_model()​

model_config​

model_config_cls​

vision_model​

VisionConfig​

depth​

devices​

dtype​

finalize()​

fullatt_block_indexes​

hidden_size​

in_channels​

initialize_from_config()​

intermediate_size​

llm_dtype​

num_attention_heads​

out_hidden_size​

patch_size​

quant_config​

rms_norm_eps​

spatial_merge_size​

temporal_patch_size​

window_size​

`Qwen2_5VLConfig`

`devices`

`finalize()`

`get_kv_params()`

`get_num_layers()`

`image_token_id`

`initialize()`

`initialize_from_config()`

`llm_config`

`mrope_section`

`spatial_merge_size`

`tokens_per_second`

`video_token_id`

`vision_config`

`vision_start_token_id`

`Qwen2_5VLInputs`

`cu_seqlens`

`cu_window_seqlens`

`has_vision_inputs`

`image_token_indices`

`input_row_offsets`

`max_grid_size`

`max_seqlen`

`max_window_seqlen`

`pixel_values`

`position_ids`

`return_n_logits`

`signal_buffers`

`tokens`

`vision_position_ids`

`window_index`

`Qwen2_5VLModel`

`batch_processor_cls`

`execute()`

`language_model`

`load_model()`

`model_config`

`model_config_cls`

`vision_model`

`VisionConfig`

`depth`

`devices`

`dtype`

`finalize()`

`fullatt_block_indexes`

`hidden_size`

`in_channels`

`initialize_from_config()`

`intermediate_size`

`llm_dtype`

`num_attention_heads`

`out_hidden_size`

`patch_size`

`quant_config`

`rms_norm_eps`

`spatial_merge_size`

`temporal_patch_size`

`window_size`