Skip to main content

Python module

max.pipelines.architectures.internvl

InternVL vision-language architecture for multimodal text generation.

InternVLConfig

class max.pipelines.architectures.internvl.InternVLConfig(*, devices, downsample_ratio, num_image_token, vision_config, llm_config)

source

Bases: ArchConfigWithKVCache

Configuration for InternVL models.

Parameters:

calculate_max_seq_len()

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculate maximum sequence length for InternVL.

Parameters:

Return type:

int

construct_kv_params()

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Parameters:

Return type:

KVCacheParams

devices

devices: list[DeviceRef]

source

Devices that the InternVL model is parallelized over.

downsample_ratio

downsample_ratio: float

source

Downsample ratio for vision features.

finalize()

finalize(huggingface_config, llm_state_dict, vision_state_dict, dtype, return_logits, norm_method='rms_norm')

source

Finalize the InternVLConfig instance with state_dict dependent fields.

Parameters:

  • huggingface_config (AutoConfig) – HuggingFace model configuration.
  • llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
  • vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
  • dtype (DType) – Data type for model parameters.
  • return_logits (ReturnLogits) – Return logits configuration.
  • norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

get_kv_params()

get_kv_params()

source

Returns the KV cache parameters from the embedded LLM config.

Return type:

KVCacheParams

get_max_seq_len()

get_max_seq_len()

source

Returns the maximum sequence length from the embedded LLM config.

Return type:

int

get_num_layers()

static get_num_layers(huggingface_config)

source

Parameters:

huggingface_config (AutoConfig)

Return type:

int

initialize()

classmethod initialize(pipeline_config, model_config=None)

source

Initializes an InternVLConfig instance from pipeline configuration.

Parameters:

Returns:

An InternVLConfig instance with fields initialized from config.

Return type:

Self

initialize_from_config()

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes an InternVLConfig from pipeline and HuggingFace configs.

This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – HuggingFace model configuration.

Returns:

An InternVLConfig instance ready for finalization.

Return type:

Self

llm_config

llm_config: Llama3Config | Qwen3Config

source

Language model configuration (Qwen2 or Qwen3).

num_image_token

num_image_token: int

source

Number of image tokens per patch.

vision_config

vision_config: VisionConfig

source

Vision encoder configuration.

InternVLModel

class max.pipelines.architectures.internvl.InternVLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]

An InternVL pipeline model for multimodal text generation.

Parameters:

calculate_max_seq_len()

static calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the maximum sequence length for the InternVL model.

Parameters:

Return type:

int

estimate_activation_memory()

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for InternVL model execution.

This accounts for the temporary memory buffers used during model execution, particularly for the vision encoder and language model activations.

Based on empirical analysis of MGP buffer plans (GEX-2365):

  • Vision encoder uses ~128MiB per image.
  • Language model uses ~100KB per token for intermediate activations.

These values come from printing the high water mark from the mgp.buffer.plan op, and verifying with GPU free memory at runtime.

The vision encoder memory scales with the number of images that can be processed concurrently, which is limited by max_batch_input_tokens / num_image_tokens where num_image_tokens=256 for InternVL.

TODO(GEX-2365): Replace this with a more general solution that analyzes the compiled graph’s memory requirements directly.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – HuggingFace model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

execute()

execute(model_inputs)

source

Executes the InternVL model with the prepared inputs.

Parameters:

model_inputs (ModelInputs)

Return type:

ModelOutputs

get_kv_params()

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Gets the parameters required to configure the KV cache for InternVL.

Parameters:

Return type:

KVCacheParams

language_model

language_model: Model

source

The compiled language model for text generation.

load_model()

load_model(session)

source

Loads the compiled InternVL models into the MAX Engine session.

Returns:

A tuple of (vision_model, language_model).

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the InternVL model.

Parameters:

Return type:

ModelInputs

prepare_next_token_inputs()

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the inputs for subsequent execution steps in a multi-step generation.

Parameters:

Return type:

ModelInputs

vision_model

vision_model: Model

source

The compiled vision model for processing images.