Python module

max.pipelines.architectures.internvl

InternVL vision-language architecture for multimodal text generation.

`InternVLConfig`

class max.pipelines.architectures.internvl.InternVLConfig(*, devices, downsample_ratio, num_image_token, vision_config, llm_config)

source

Bases: ArchConfigWithKVCache

Configuration for InternVL models.

Parameters:

devices (list[DeviceRef])
downsample_ratio (float)
num_image_token (int)
vision_config (VisionConfig)
llm_config (Llama3Config | Qwen3Config)

`calculate_max_seq_len()`

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length for InternVL.

Parameters:

pipeline_config (PipelineConfig)
huggingface_config (AutoConfig)

Return type:

int

`construct_kv_params()`

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Parameters:

huggingface_config (AutoConfig)
pipeline_config (PipelineConfig)
devices (list[DeviceRef])
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

KVCacheParams

`devices`

devices: list[DeviceRef]

source

Devices that the InternVL model is parallelized over.

`downsample_ratio`

downsample_ratio: float

source

Downsample ratio for vision features.

`finalize()`

finalize(huggingface_config, llm_state_dict, vision_state_dict, dtype, return_logits, norm_method='rms_norm')

source

Finalize the InternVLConfig instance with state_dict dependent fields.

Parameters:

huggingface_config (AutoConfig) – HuggingFace model configuration.
llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
dtype (DType) – Data type for model parameters.
return_logits (ReturnLogits) – Return logits configuration.
norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

`get_kv_params()`

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:: KVCacheParams

`get_max_seq_len()`

get_max_seq_len()

source

Returns the maximum sequence length from the embedded LLM config.

Return type:: int

`get_num_layers()`

static get_num_layers(huggingface_config)

source

Parameters:: huggingface_config (AutoConfig)
Return type:: int

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initializes an InternVLConfig instance from pipeline configuration.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
model_config (MAXModelConfig | None)

Returns:

An InternVLConfig instance with fields initialized from config.

Return type:

Self

`initialize_from_config()`

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes an InternVLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

An InternVLConfig instance ready for finalization.

Return type:

Self

`llm_config`

llm_config: Llama3Config | Qwen3Config

source

Language model configuration (Qwen2 or Qwen3).

`num_image_token`

num_image_token: int

source

Number of image tokens per patch.

`vision_config`

vision_config: VisionConfig

source

Vision encoder configuration.

`InternVLInputs`

class max.pipelines.architectures.internvl.InternVLInputs(tokens, input_row_offsets, signal_buffers, return_n_logits, pixel_values=None, image_token_indices=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the InternVL model.

Parameters:

tokens (Buffer)
input_row_offsets (list[Buffer])
signal_buffers (list[Buffer])
return_n_logits (Buffer)
pixel_values (list[Buffer] | None)
image_token_indices (list[Buffer] | None)
kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
lora_ids (Buffer | None)
lora_ranks (Buffer | None)
hidden_states (Buffer | list[Buffer] | None)

`has_vision_inputs`

property has_vision_inputs: bool

source

Check if this input contains vision data.

`image_token_indices`

image_token_indices: list[Buffer] | None = None

source

Per-device pre-computed indices of image tokens in the input sequence.

`input_row_offsets`

input_row_offsets: list[Buffer]

source

Per-device tensors containing the offsets for each row in the ragged input sequence.

`pixel_values`

pixel_values: list[Buffer] | None = None

source

Pixel values for vision inputs.

`return_n_logits`

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

`signal_buffers`

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

`tokens`

tokens: Buffer

source

Tensor containing the input token IDs.

`InternVLModel`

class max.pipelines.architectures.internvl.InternVLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]

An InternVL pipeline model for multimodal text generation.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)

`calculate_max_seq_len()`

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the maximum sequence length for the InternVL model.

Parameters:

pipeline_config (PipelineConfig)
huggingface_config (AutoConfig)

Return type:

int

`estimate_activation_memory()`

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for InternVL model execution.

This accounts for the temporary memory buffers used during model execution, particularly for the vision encoder and language model activations.

Based on empirical analysis of MGP buffer plans (GEX-2365):

Vision encoder uses ~128MiB per image.
Language model uses ~100KB per token for intermediate activations.

These values come from printing the high water mark from the mgp.buffer.plan op, and verifying with GPU free memory at runtime.

The vision encoder memory scales with the number of images that can be processed concurrently, which is limited by max_batch_input_tokens / num_image_tokens where num_image_tokens=256 for InternVL.

TODO(GEX-2365): Replace this with a more general solution that analyzes the compiled graph’s memory requirements directly.

Parameters:

pipeline_config (PipelineConfig) – Pipeline configuration
huggingface_config (AutoConfig) – HuggingFace model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

`execute()`

execute(model_inputs)

source

Executes the InternVL model with the prepared inputs.

Parameters:: model_inputs (ModelInputs)
Return type:: ModelOutputs

`get_kv_params()`

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Gets the parameters required to configure the KV cache for InternVL.

Parameters:

huggingface_config (AutoConfig)
pipeline_config (PipelineConfig)
devices (list[DeviceRef])
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

KVCacheParams

`language_model`

language_model: Model

source

The compiled language model for text generation.

`load_model()`

load_model(session)

source

Loads the compiled InternVL models into the MAX Engine session.

Returns:: A tuple of (vision_model, language_model).
Parameters:: session (InferenceSession)
Return type:: tuple[Model, Model]

`prepare_initial_token_inputs()`

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the InternVL model.

Parameters:

replica_batches (Sequence[Sequence[TextAndVisionContext]])
kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
return_n_logits (int)

Return type:

ModelInputs

`prepare_next_token_inputs()`

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the inputs for subsequent execution steps in a multi-step generation.

Parameters:

next_tokens (Buffer)
prev_model_inputs (ModelInputs)

Return type:

ModelInputs

`vision_model`

vision_model: Model

source

The compiled vision model for processing images.

`VisionConfig`

class max.pipelines.architectures.internvl.VisionConfig(hidden_size, intermediate_size, norm_type, image_size, patch_size, num_attention_heads, head_dim, layer_norm_eps, qk_normalization, qkv_bias, num_hidden_layers, dtype=bfloat16, o_proj_bias=False)

source

Bases: object

Base configuration for InternVL models with required fields.

Parameters:

hidden_size (int)
intermediate_size (int)
norm_type (Literal['rms_norm', 'layer_norm'])
image_size (int)
patch_size (int)
num_attention_heads (int)
head_dim (int)
layer_norm_eps (float)
qk_normalization (bool)
qkv_bias (bool)
num_hidden_layers (int)
dtype (DType)
o_proj_bias (bool)

`dtype`

dtype: DType = 80

source

DType of the InternVL vision model weights.

`finalize()`

finalize(dtype, state_dict)

source

Finalize VisionConfig with state_dict dependent fields.

Parameters:

dtype (DType)
state_dict (dict[str, WeightData])

Return type:

None

`head_dim`

head_dim: int

source

Dimension of each attention head.

`hidden_size`

hidden_size: int

source

Hidden size of the vision encoder.

`image_size`

image_size: int

source

Input image size.

`initialize_from_config()`

classmethod initialize_from_config(hf_vision_config)

source

Initialize VisionConfig from HuggingFace vision config.

Note: dtype and o_proj_bias fields will be set to defaults and should be updated via finalize() once state_dict is available.

Parameters:: hf_vision_config (AutoConfig)
Return type:: VisionConfig

`intermediate_size`

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

`layer_norm_eps`

layer_norm_eps: float

source

Epsilon for layer normalization.

`norm_type`

norm_type: Literal['rms_norm', 'layer_norm']

source

Type of normalization used in the vision encoder.

`num_attention_heads`

num_attention_heads: int

source

Number of attention heads in the vision encoder.

`num_hidden_layers`

num_hidden_layers: int

source

Number of hidden layers in the vision encoder.

`o_proj_bias`

o_proj_bias: bool = False

source

Whether to use bias in the out projection.

`patch_size`

patch_size: int

source

Vision transformer patch size.

`qk_normalization`

qk_normalization: bool

source

Whether to use QK normalization in attention.

`qkv_bias`

qkv_bias: bool

source

False.

Type:: Whether to use bias in the QKV projection. Default

InternVLConfig​

calculate_max_seq_len()​

construct_kv_params()​

devices​

downsample_ratio​

finalize()​

get_kv_params()​

get_max_seq_len()​

get_num_layers()​

initialize()​

initialize_from_config()​

llm_config​

num_image_token​

vision_config​

InternVLInputs​

has_vision_inputs​

image_token_indices​

input_row_offsets​

pixel_values​

return_n_logits​

signal_buffers​

tokens​

InternVLModel​

calculate_max_seq_len()​

estimate_activation_memory()​

execute()​

get_kv_params()​

language_model​

load_model()​

prepare_initial_token_inputs()​

prepare_next_token_inputs()​

vision_model​

VisionConfig​

dtype​

finalize()​

head_dim​

hidden_size​

image_size​

initialize_from_config()​

intermediate_size​

layer_norm_eps​

norm_type​

num_attention_heads​

num_hidden_layers​

o_proj_bias​

patch_size​

qk_normalization​

qkv_bias​

`InternVLConfig`

`calculate_max_seq_len()`

`construct_kv_params()`

`devices`

`downsample_ratio`

`finalize()`

`get_kv_params()`

`get_max_seq_len()`

`get_num_layers()`

`initialize()`

`initialize_from_config()`

`llm_config`

`num_image_token`

`vision_config`

`InternVLInputs`

`has_vision_inputs`

`image_token_indices`

`input_row_offsets`

`pixel_values`

`return_n_logits`

`signal_buffers`

`tokens`

`InternVLModel`

`calculate_max_seq_len()`

`estimate_activation_memory()`

`execute()`

`get_kv_params()`

`language_model`

`load_model()`

`prepare_initial_token_inputs()`

`prepare_next_token_inputs()`

`vision_model`

`VisionConfig`

`dtype`

`finalize()`

`head_dim`

`hidden_size`

`image_size`

`initialize_from_config()`

`intermediate_size`

`layer_norm_eps`

`norm_type`

`num_attention_heads`

`num_hidden_layers`

`o_proj_bias`

`patch_size`

`qk_normalization`

`qkv_bias`