Python module
max.pipelines.architectures.internvl
InternVL vision-language architecture for multimodal text generation.
InternVLConfigβ
class max.pipelines.architectures.internvl.InternVLConfig(*, devices, downsample_ratio, num_image_token, vision_config, llm_config)
Bases: ArchConfigWithKVCache
Configuration for InternVL models.
-
Parameters:
-
- devices (list[DeviceRef])
- downsample_ratio (float)
- num_image_token (int)
- vision_config (VisionConfig)
- llm_config (Llama3Config | Qwen3Config)
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculate maximum sequence length for InternVL.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()β
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devicesβ
Devices that the InternVL model is parallelized over.
downsample_ratioβ
downsample_ratio: float
Downsample ratio for vision features.
finalize()β
finalize(huggingface_config, llm_state_dict, vision_state_dict, dtype, return_logits, norm_method='rms_norm')
Finalize the InternVLConfig instance with state_dict dependent fields.
-
Parameters:
-
- huggingface_config (AutoConfig) β HuggingFace model configuration.
- llm_state_dict (dict[str, WeightData]) β Language model weights dictionary.
- vision_state_dict (dict[str, WeightData]) β Vision encoder weights dictionary.
- dtype (DType) β Data type for model parameters.
- return_logits (ReturnLogits) β Return logits configuration.
- norm_method (Literal['rms_norm', 'layer_norm']) β Normalization method.
-
Return type:
-
None
get_kv_params()β
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()β
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes an InternVLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
An InternVLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, huggingface_config)
Initializes an InternVLConfig from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) β HuggingFace model configuration.
-
Returns:
-
An InternVLConfig instance ready for finalization.
-
Return type:
llm_configβ
llm_config: Llama3Config | Qwen3Config
Language model configuration (Qwen2 or Qwen3).
num_image_tokenβ
num_image_token: int
Number of image tokens per patch.
vision_configβ
vision_config: VisionConfig
Vision encoder configuration.
InternVLInputsβ
class max.pipelines.architectures.internvl.InternVLInputs(tokens, input_row_offsets, signal_buffers, return_n_logits, pixel_values=None, image_token_indices=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)
Bases: ModelInputs
A class representing inputs for the InternVL model.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (list[Buffer])
- signal_buffers (list[Buffer])
- return_n_logits (Buffer)
- pixel_values (list[Buffer] | None)
- image_token_indices (list[Buffer] | None)
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
has_vision_inputsβ
property has_vision_inputs: bool
Check if this input contains vision data.
image_token_indicesβ
Per-device pre-computed indices of image tokens in the input sequence.
input_row_offsetsβ
Per-device tensors containing the offsets for each row in the ragged input sequence.
pixel_valuesβ
Pixel values for vision inputs.
return_n_logitsβ
return_n_logits: Buffer
Number of logits to return, used by speculative decoding for example.
signal_buffersβ
Device buffers used for synchronization in communication collectives.
tokensβ
tokens: Buffer
Tensor containing the input token IDs.
InternVLModelβ
class max.pipelines.architectures.internvl.InternVLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]
An InternVL pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the maximum sequence length for the InternVL model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
estimate_activation_memory()β
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for InternVL model execution.
This accounts for the temporary memory buffers used during model execution, particularly for the vision encoder and language model activations.
Based on empirical analysis of MGP buffer plans (GEX-2365):
- Vision encoder uses ~128MiB per image.
- Language model uses ~100KB per token for intermediate activations.
These values come from printing the high water mark from the mgp.buffer.plan op, and verifying with GPU free memory at runtime.
The vision encoder memory scales with the number of images that can be processed concurrently, which is limited by max_batch_input_tokens / num_image_tokens where num_image_tokens=256 for InternVL.
TODO(GEX-2365): Replace this with a more general solution that analyzes the compiled graphβs memory requirements directly.
-
Parameters:
-
- pipeline_config (PipelineConfig) β Pipeline configuration
- huggingface_config (AutoConfig) β HuggingFace model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
execute()β
execute(model_inputs)
Executes the InternVL model with the prepared inputs.
-
Parameters:
-
model_inputs (ModelInputs)
-
Return type:
get_kv_params()β
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Gets the parameters required to configure the KV cache for InternVL.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_modelβ
language_model: Model
The compiled language model for text generation.
load_model()β
load_model(session)
Loads the compiled InternVL models into the MAX Engine session.
-
Returns:
-
A tuple of (vision_model, language_model).
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs for the first execution pass of the InternVL model.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextAndVisionContext]])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- return_n_logits (int)
-
Return type:
prepare_next_token_inputs()β
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the inputs for subsequent execution steps in a multi-step generation.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
vision_modelβ
vision_model: Model
The compiled vision model for processing images.
VisionConfigβ
class max.pipelines.architectures.internvl.VisionConfig(hidden_size, intermediate_size, norm_type, image_size, patch_size, num_attention_heads, head_dim, layer_norm_eps, qk_normalization, qkv_bias, num_hidden_layers, dtype=bfloat16, o_proj_bias=False)
Bases: object
Base configuration for InternVL models with required fields.
-
Parameters:
dtypeβ
dtype: DType = 80
DType of the InternVL vision model weights.
finalize()β
finalize(dtype, state_dict)
Finalize VisionConfig with state_dict dependent fields.
-
Parameters:
-
- dtype (DType)
- state_dict (dict[str, WeightData])
-
Return type:
-
None
head_dimβ
head_dim: int
Dimension of each attention head.
hidden_sizeβ
hidden_size: int
Hidden size of the vision encoder.
image_sizeβ
image_size: int
Input image size.
initialize_from_config()β
classmethod initialize_from_config(hf_vision_config)
Initialize VisionConfig from HuggingFace vision config.
Note: dtype and o_proj_bias fields will be set to defaults and should be updated via finalize() once state_dict is available.
-
Parameters:
-
hf_vision_config (AutoConfig)
-
Return type:
intermediate_sizeβ
intermediate_size: int
Intermediate size in the vision encoderβs feed-forward layers.
layer_norm_epsβ
layer_norm_eps: float
Epsilon for layer normalization.
norm_typeβ
norm_type: Literal['rms_norm', 'layer_norm']
Type of normalization used in the vision encoder.
num_attention_headsβ
num_attention_heads: int
Number of attention heads in the vision encoder.
num_hidden_layersβ
num_hidden_layers: int
Number of hidden layers in the vision encoder.
o_proj_biasβ
o_proj_bias: bool = False
Whether to use bias in the out projection.
patch_sizeβ
patch_size: int
Vision transformer patch size.
qk_normalizationβ
qk_normalization: bool
Whether to use QK normalization in attention.
qkv_biasβ
qkv_bias: bool
False.
-
Type:
-
Whether to use bias in the QKV projection. Default
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!