Skip to main content

Python module

max.pipelines.architectures.internvl

InternVL vision-language architecture for multimodal text generation.

InternVLConfig​

class max.pipelines.architectures.internvl.InternVLConfig(*, devices, downsample_ratio, num_image_token, vision_config, llm_config)

source

Bases: ArchConfigWithKVCache

Configuration for InternVL models.

Parameters:

calculate_max_seq_len()​

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length for InternVL.

Parameters:

Return type:

int

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Parameters:

Return type:

KVCacheParams

devices​

devices: list[DeviceRef]

source

Devices that the InternVL model is parallelized over.

downsample_ratio​

downsample_ratio: float

source

Downsample ratio for vision features.

finalize()​

finalize(huggingface_config, llm_state_dict, vision_state_dict, dtype, return_logits, norm_method='rms_norm')

source

Finalize the InternVLConfig instance with state_dict dependent fields.

Parameters:

  • huggingface_config (AutoConfig) – HuggingFace model configuration.
  • llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
  • vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
  • dtype (DType) – Data type for model parameters.
  • return_logits (ReturnLogits) – Return logits configuration.
  • norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

get_kv_params()​

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:

KVCacheParams

get_max_seq_len()​

get_max_seq_len()

source

Returns the maximum sequence length from the embedded LLM config.

Return type:

int

get_num_layers()​

static get_num_layers(huggingface_config)

source

Parameters:

huggingface_config (AutoConfig)

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes an InternVLConfig instance from pipeline configuration.

Parameters:

Returns:

An InternVLConfig instance with fields initialized from config.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes an InternVLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

An InternVLConfig instance ready for finalization.

Return type:

Self

llm_config​

llm_config: Llama3Config | Qwen3Config

source

Language model configuration (Qwen2 or Qwen3).

num_image_token​

num_image_token: int

source

Number of image tokens per patch.

vision_config​

vision_config: VisionConfig

source

Vision encoder configuration.

InternVLInputs​

class max.pipelines.architectures.internvl.InternVLInputs(tokens, input_row_offsets, signal_buffers, return_n_logits, pixel_values=None, image_token_indices=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the InternVL model.

Parameters:

has_vision_inputs​

property has_vision_inputs: bool

source

Check if this input contains vision data.

image_token_indices​

image_token_indices: list[Buffer] | None = None

source

Per-device pre-computed indices of image tokens in the input sequence.

input_row_offsets​

input_row_offsets: list[Buffer]

source

Per-device tensors containing the offsets for each row in the ragged input sequence.

pixel_values​

pixel_values: list[Buffer] | None = None

source

Pixel values for vision inputs.

return_n_logits​

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

signal_buffers​

signal_buffers: list[Buffer]

source

Device buffers used for synchronization in communication collectives.

tokens​

tokens: Buffer

source

Tensor containing the input token IDs.

InternVLModel​

class max.pipelines.architectures.internvl.InternVLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]

An InternVL pipeline model for multimodal text generation.

Parameters:

calculate_max_seq_len()​

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the maximum sequence length for the InternVL model.

Parameters:

Return type:

int

estimate_activation_memory()​

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for InternVL model execution.

This accounts for the temporary memory buffers used during model execution, particularly for the vision encoder and language model activations.

Based on empirical analysis of MGP buffer plans (GEX-2365):

  • Vision encoder uses ~128MiB per image.
  • Language model uses ~100KB per token for intermediate activations.

These values come from printing the high water mark from the mgp.buffer.plan op, and verifying with GPU free memory at runtime.

The vision encoder memory scales with the number of images that can be processed concurrently, which is limited by max_batch_input_tokens / num_image_tokens where num_image_tokens=256 for InternVL.

TODO(GEX-2365): Replace this with a more general solution that analyzes the compiled graph’s memory requirements directly.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – HuggingFace model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

execute()​

execute(model_inputs)

source

Executes the InternVL model with the prepared inputs.

Parameters:

model_inputs (ModelInputs)

Return type:

ModelOutputs

get_kv_params()​

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Gets the parameters required to configure the KV cache for InternVL.

Parameters:

Return type:

KVCacheParams

language_model​

language_model: Model

source

The compiled language model for text generation.

load_model()​

load_model(session)

source

Loads the compiled InternVL models into the MAX Engine session.

Returns:

A tuple of (vision_model, language_model).

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the InternVL model.

Parameters:

Return type:

ModelInputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the inputs for subsequent execution steps in a multi-step generation.

Parameters:

Return type:

ModelInputs

vision_model​

vision_model: Model

source

The compiled vision model for processing images.

VisionConfig​

class max.pipelines.architectures.internvl.VisionConfig(hidden_size, intermediate_size, norm_type, image_size, patch_size, num_attention_heads, head_dim, layer_norm_eps, qk_normalization, qkv_bias, num_hidden_layers, dtype=bfloat16, o_proj_bias=False)

source

Bases: object

Base configuration for InternVL models with required fields.

Parameters:

  • hidden_size (int)
  • intermediate_size (int)
  • norm_type (Literal['rms_norm', 'layer_norm'])
  • image_size (int)
  • patch_size (int)
  • num_attention_heads (int)
  • head_dim (int)
  • layer_norm_eps (float)
  • qk_normalization (bool)
  • qkv_bias (bool)
  • num_hidden_layers (int)
  • dtype (DType)
  • o_proj_bias (bool)

dtype​

dtype: DType = 80

source

DType of the InternVL vision model weights.

finalize()​

finalize(dtype, state_dict)

source

Finalize VisionConfig with state_dict dependent fields.

Parameters:

Return type:

None

head_dim​

head_dim: int

source

Dimension of each attention head.

hidden_size​

hidden_size: int

source

Hidden size of the vision encoder.

image_size​

image_size: int

source

Input image size.

initialize_from_config()​

classmethod initialize_from_config(hf_vision_config)

source

Initialize VisionConfig from HuggingFace vision config.

Note: dtype and o_proj_bias fields will be set to defaults and should be updated via finalize() once state_dict is available.

Parameters:

hf_vision_config (AutoConfig)

Return type:

VisionConfig

intermediate_size​

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

layer_norm_eps​

layer_norm_eps: float

source

Epsilon for layer normalization.

norm_type​

norm_type: Literal['rms_norm', 'layer_norm']

source

Type of normalization used in the vision encoder.

num_attention_heads​

num_attention_heads: int

source

Number of attention heads in the vision encoder.

num_hidden_layers​

num_hidden_layers: int

source

Number of hidden layers in the vision encoder.

o_proj_bias​

o_proj_bias: bool = False

source

Whether to use bias in the out projection.

patch_size​

patch_size: int

source

Vision transformer patch size.

qk_normalization​

qk_normalization: bool

source

Whether to use QK normalization in attention.

qkv_bias​

qkv_bias: bool

source

False.

Type:

Whether to use bias in the QKV projection. Default