For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.qwen2_5vl

Qwen2.5-VL vision-language architecture for multimodal text generation.

`Qwen2_5VLConfig`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLConfig(*, devices, image_token_id, video_token_id, vision_start_token_id, spatial_merge_size, tokens_per_second, mrope_section, vision_config, llm_config)

source

Bases: ArchVLConfigWithTextSubconfig, ArchConfigWithKVCache

Configuration for Qwen2.5VL models.

Parameters:

devices (list[DeviceRef])
image_token_id (int)
video_token_id (int)
vision_start_token_id (int)
spatial_merge_size (int)
tokens_per_second (int)
mrope_section (list[int])
vision_config (VisionConfig)
llm_config (Llama3Config)

`devices`

devices: list[DeviceRef]

source

Devices that the Qwen2.5VL model is parallelized over.

`finalize()`

finalize(huggingface_config, pipeline_config, llm_state_dict, vision_state_dict, return_logits, norm_method='rms_norm')

source

Finalize the Qwen2_5VLConfig instance with state_dict dependent fields.

Parameters:

huggingface_config (AutoConfig) – HuggingFace model configuration.
pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
return_logits (ReturnLogits) – Return logits configuration.
norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

`get_kv_params()`

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:: KVCacheParams

`get_num_layers()`

static get_num_layers(huggingface_config)

source

Parameters:: huggingface_config (AutoConfig)
Return type:: int

`image_token_id`

image_token_id: int

source

Token ID used for image placeholders in the input sequence.

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initializes a Qwen2_5VLConfig instance from pipeline configuration.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
model_config (MAXModelConfig | None)

Returns:

A Qwen2_5VLConfig instance with fields initialized from config.

Return type:

Self

`initialize_from_config()`

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes a Qwen2_5VLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

A Qwen2_5VLConfig instance ready for finalization.

Return type:

Self

`llm_config`

llm_config: Llama3Config

source

Language model configuration using Llama3 architecture.

`mrope_section`

mrope_section: list[int]

source

List of indices for the mrope section.

`spatial_merge_size`

spatial_merge_size: int

source

Size parameter for spatial merging of vision features.

`tokens_per_second`

tokens_per_second: int

source

Number of tokens per second.

`video_token_id`

video_token_id: int

source

Token ID used for video placeholders in the input sequence.

`vision_config`

vision_config: VisionConfig

source

Vision encoder configuration.

`vision_start_token_id`

vision_start_token_id: int

source

Token ID that marks the start of vision content.

`Qwen2_5VLInputs`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLInputs(tokens, input_row_offsets, signal_buffers, position_ids, return_n_logits, image_token_indices=None, pixel_values=None, window_index=None, vision_position_ids=None, max_grid_size=None, cu_seqlens=None, cu_window_seqlens=None, max_seqlen=None, max_window_seqlen=None, *, kv_cache_inputs, lora=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the Qwen2.5VL model.

This class encapsulates the input tensors required for the Qwen2.5VL model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.

Parameters:

tokens (Buffer)
input_row_offsets (list[Buffer])
signal_buffers (list[Buffer])
position_ids (Buffer)
return_n_logits (Buffer)
image_token_indices (list[Buffer] | None)
pixel_values (list[Buffer] | None)
window_index (list[Buffer] | None)
vision_position_ids (list[Buffer] | None)
max_grid_size (list[Buffer] | None)
cu_seqlens (list[Buffer] | None)
cu_window_seqlens (list[Buffer] | None)
max_seqlen (list[Buffer] | None)
max_window_seqlen (list[Buffer] | None)
kv_cache_inputs (KVCacheInputs[Buffer, Buffer])
lora (LoRAInputs | None)
hidden_states (Buffer | list[Buffer] | None)

`cu_seqlens`

cu_seqlens: list[Buffer] | None = None

source

Cumulative sequence lengths for full attention.

`cu_window_seqlens`

cu_window_seqlens: list[Buffer] | None = None

source

Cumulative window sequence lengths for window attention.

`has_vision_inputs`

property has_vision_inputs: bool

source

Check if this input contains vision data.

`image_token_indices`

image_token_indices: list[Buffer] | None = None

source

Per-device pre-computed multimodal merge indices for the image embeddings.

These are the locations of the image_token_id in the inputs fed to the model.

Some indices may be negative, which means that they are ignored by the multimodal merge.

`input_row_offsets`

input_row_offsets: list[Buffer]

source

Per-device tensors containing the offsets for each row in the ragged input sequence.

`max_grid_size`

max_grid_size: list[Buffer] | None = None

source

Maximum grid size for vision inputs.

`max_seqlen`

max_seqlen: list[Buffer] | None = None

source

Maximum sequence length for full attention for vision inputs.

`max_window_seqlen`

max_window_seqlen: list[Buffer] | None = None

source

Maximum sequence length for window attention for vision inputs.

`pixel_values`

pixel_values: list[Buffer] | None = None

source

Pixel values for vision inputs.

`position_ids`

position_ids: Buffer

source

3D RoPE position IDs for the decoder.

`return_n_logits`

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

`signal_buffers`

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

`tokens`

tokens: Buffer

source

Tensor containing the input token IDs.

`vision_position_ids`

vision_position_ids: list[Buffer] | None = None

source

1D RoPE position IDs for the visual inputs.

`window_index`

window_index: list[Buffer] | None = None

source

Window indices for vision attention mechanism.

`Qwen2_5VLModel`

class max.pipelines.architectures.qwen2_5vl.Qwen2_5VLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]

A Qwen2.5VL pipeline model for multimodal text generation.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)

`execute()`

execute(model_inputs)

source

Executes the Qwen2.5VL model with the prepared inputs.

Parameters:: model_inputs (ModelInputs)
Return type:: ModelOutputs

`language_model`

language_model: Model

source

The compiled language model for text generation.

`load_model()`

load_model(session)

source

Loads the compiled Qwen2.5VL models into the MAX Engine session.

Returns:: A tuple of (vision_model, language_model).
Parameters:: session (InferenceSession)
Return type:: tuple[Model, Model]

`model_config`

model_config: Qwen2_5VLConfig | None

source

The Qwen2.5VL model configuration.

`model_config_cls`

model_config_cls

source

alias of Qwen2_5VLConfig

`prepare_decoder_position_ids()`

static prepare_decoder_position_ids(context_batch, devices)

source

Prepare decoder position IDs for a batch of contexts.

This function computes position IDs for decoder tokens, handling three cases:

Vision encoding with pre-computed position IDs (use stored values)
Vision encoding requiring recomputation (after preemption)
Text-only generation (simple arange with offset)

Optimized implementation: pre-allocates output array and writes directly, avoiding concatenation overhead for better performance.

Parameters:

context_batch (Sequence[Qwen2_5VLTextAndVisionContext]) – Sequence of Qwen2.5VL contexts to process
devices (list[Device]) – List of devices to place the output tensor on

Returns:

Buffer containing decoder position IDs with shape [n_rope_sections, total_seq_len]

Return type:

Buffer

`prepare_initial_token_inputs()`

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the Qwen2.5VL model.

Parameters:

replica_batches (Sequence[Sequence[Qwen2_5VLTextAndVisionContext]])
kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
return_n_logits (int)

Return type:

Qwen2_5VLInputs

`vision_model`

vision_model: Model

source

The compiled vision model for processing images.

`VisionConfig`

class max.pipelines.architectures.qwen2_5vl.VisionConfig(dtype, llm_dtype, devices, patch_size, temporal_patch_size, in_channels, hidden_size, num_attention_heads, depth, intermediate_size, out_hidden_size, fullatt_block_indexes, rms_norm_eps, window_size, spatial_merge_size, quant_config=None)

source

Bases: object

Base configuration for Qwen2.5VL models with required fields.

Parameters:

dtype (DType)
llm_dtype (DType)
devices (list[DeviceRef])
patch_size (int)
temporal_patch_size (int)
in_channels (int)
hidden_size (int)
num_attention_heads (int)
depth (int)
intermediate_size (int)
out_hidden_size (int)
fullatt_block_indexes (list[int])
rms_norm_eps (float)
window_size (int)
spatial_merge_size (int)
quant_config (QuantConfig | None)

`depth`

depth: int

source

Number of vision transformer layers.

`devices`

devices: list[DeviceRef]

source

Devices that the Qwen2.5VL vision encoder model is parallelized over.

`dtype`

dtype: DType

source

DType of the Qwen2.5VL vision model weights.

`finalize()`

finalize(huggingface_config, vision_state_dict, vision_dtype, llm_dtype)

source

Finalize VisionConfig with state_dict dependent fields.

Parameters:

huggingface_config (AutoConfig)
vision_state_dict (dict[str, WeightData])
vision_dtype (DType)
llm_dtype (DType)

Return type:

None

`fullatt_block_indexes`

fullatt_block_indexes: list[int]

source

Indexes of the full attention blocks in the vision encoder.

`hidden_size`

hidden_size: int

source

Hidden size of the vision encoder.

`in_channels`

in_channels: int

source

Vision transformer number of input channels.

`initialize_from_config()`

classmethod initialize_from_config(pipeline_config, hf_vision_config)

source

Initialize VisionConfig from HuggingFace vision config.

Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.

Parameters:

pipeline_config (PipelineConfig)
hf_vision_config (AutoConfig)

Return type:

VisionConfig

`intermediate_size`

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

`llm_dtype`

llm_dtype: DType

source

DType of the Qwen2.5VL language model weights.

`num_attention_heads`

num_attention_heads: int

source

Number of attention heads in the vision encoder.

`out_hidden_size`

out_hidden_size: int

source

Output hidden size of the vision encoder. Also the hidden size of the language model.

`patch_size`

patch_size: int

source

Vision transformer patch size.

`quant_config`

quant_config: QuantConfig | None = None

source

Scaled quantization configuration for the vision encoder.

`rms_norm_eps`

rms_norm_eps: float

source

Epsilon for layer normalization.

`spatial_merge_size`

spatial_merge_size: int

source

Spatial merge size for the vision encoder.

`temporal_patch_size`

temporal_patch_size: int

source

Vision transformer temporal patch size.

`window_size`

window_size: int

source

Window size for the vision encoder.

Qwen2_5VLConfig
Qwen2_5VLInputs
Qwen2_5VLModel
VisionConfig

Qwen2_5VLConfig​

devices​

finalize()​

get_kv_params()​

get_num_layers()​

image_token_id​

initialize()​

initialize_from_config()​

llm_config​

mrope_section​

spatial_merge_size​

tokens_per_second​

video_token_id​

vision_config​

vision_start_token_id​

Qwen2_5VLInputs​

cu_seqlens​

cu_window_seqlens​

has_vision_inputs​

image_token_indices​

input_row_offsets​

max_grid_size​

max_seqlen​

max_window_seqlen​

pixel_values​

position_ids​

return_n_logits​

signal_buffers​

tokens​

vision_position_ids​

window_index​

Qwen2_5VLModel​

execute()​

language_model​

load_model()​

model_config​

model_config_cls​

prepare_decoder_position_ids()​

prepare_initial_token_inputs()​

vision_model​

VisionConfig​

depth​

devices​

dtype​

finalize()​

fullatt_block_indexes​

hidden_size​

in_channels​

initialize_from_config()​

intermediate_size​

llm_dtype​

num_attention_heads​

out_hidden_size​

patch_size​

quant_config​

rms_norm_eps​

spatial_merge_size​

temporal_patch_size​

window_size​

`Qwen2_5VLConfig`

`devices`

`finalize()`

`get_kv_params()`

`get_num_layers()`

`image_token_id`

`initialize()`

`initialize_from_config()`

`llm_config`

`mrope_section`

`spatial_merge_size`

`tokens_per_second`

`video_token_id`

`vision_config`

`vision_start_token_id`

`Qwen2_5VLInputs`

`cu_seqlens`

`cu_window_seqlens`

`has_vision_inputs`

`image_token_indices`

`input_row_offsets`

`max_grid_size`

`max_seqlen`

`max_window_seqlen`

`pixel_values`

`position_ids`

`return_n_logits`

`signal_buffers`

`tokens`

`vision_position_ids`

`window_index`

`Qwen2_5VLModel`

`execute()`

`language_model`

`load_model()`

`model_config`

`model_config_cls`

`prepare_decoder_position_ids()`

`prepare_initial_token_inputs()`

`vision_model`

`VisionConfig`

`depth`

`devices`

`dtype`

`finalize()`

`fullatt_block_indexes`

`hidden_size`

`in_channels`

`initialize_from_config()`

`intermediate_size`

`llm_dtype`

`num_attention_heads`

`out_hidden_size`

`patch_size`

`quant_config`

`rms_norm_eps`

`spatial_merge_size`

`temporal_patch_size`

`window_size`