Python module
max.pipelines.architectures.internvl
InternVL vision-language architecture for multimodal text generation.
InternVLConfig
class max.pipelines.architectures.internvl.InternVLConfig(*, devices, downsample_ratio, num_image_token, vision_config, llm_config)
Bases: ArchConfigWithKVCache
Configuration for InternVL models.
-
Parameters:
-
- devices (list[DeviceRef])
- downsample_ratio (float)
- num_image_token (int)
- vision_config (VisionConfig)
- llm_config (Llama3Config | Qwen3Config)
calculate_max_seq_len()
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculate maximum sequence length for InternVL.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devices
Devices that the InternVL model is parallelized over.
downsample_ratio
downsample_ratio: float
Downsample ratio for vision features.
finalize()
finalize(huggingface_config, llm_state_dict, vision_state_dict, dtype, return_logits, norm_method='rms_norm')
Finalize the InternVLConfig instance with state_dict dependent fields.
-
Parameters:
-
- huggingface_config (AutoConfig) – HuggingFace model configuration.
- llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
- vision_state_dict (dict[str, WeightData]) – Vision encoder weights dictionary.
- dtype (DType) – Data type for model parameters.
- return_logits (ReturnLogits) – Return logits configuration.
- norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.
-
Return type:
-
None
get_kv_params()
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
initialize()
classmethod initialize(pipeline_config, model_config=None)
Initializes an InternVLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
An InternVLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()
classmethod initialize_from_config(pipeline_config, huggingface_config)
Initializes an InternVLConfig from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) – HuggingFace model configuration.
-
Returns:
-
An InternVLConfig instance ready for finalization.
-
Return type:
llm_config
llm_config: Llama3Config | Qwen3Config
Language model configuration (Qwen2 or Qwen3).
num_image_token
num_image_token: int
Number of image tokens per patch.
vision_config
vision_config: VisionConfig
Vision encoder configuration.
InternVLModel
class max.pipelines.architectures.internvl.InternVLModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[TextAndVisionContext]
An InternVL pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
calculate_max_seq_len()
static calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the maximum sequence length for the InternVL model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
estimate_activation_memory()
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for InternVL model execution.
This accounts for the temporary memory buffers used during model execution, particularly for the vision encoder and language model activations.
Based on empirical analysis of MGP buffer plans (GEX-2365):
- Vision encoder uses ~128MiB per image.
- Language model uses ~100KB per token for intermediate activations.
These values come from printing the high water mark from the mgp.buffer.plan op, and verifying with GPU free memory at runtime.
The vision encoder memory scales with the number of images that can be processed concurrently, which is limited by max_batch_input_tokens / num_image_tokens where num_image_tokens=256 for InternVL.
TODO(GEX-2365): Replace this with a more general solution that analyzes the compiled graph’s memory requirements directly.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Pipeline configuration
- huggingface_config (AutoConfig) – HuggingFace model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
execute()
execute(model_inputs)
Executes the InternVL model with the prepared inputs.
-
Parameters:
-
model_inputs (ModelInputs)
-
Return type:
get_kv_params()
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Gets the parameters required to configure the KV cache for InternVL.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_model
language_model: Model
The compiled language model for text generation.
load_model()
load_model(session)
Loads the compiled InternVL models into the MAX Engine session.
-
Returns:
-
A tuple of (vision_model, language_model).
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs for the first execution pass of the InternVL model.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextAndVisionContext]])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- return_n_logits (int)
-
Return type:
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the inputs for subsequent execution steps in a multi-step generation.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
vision_model
vision_model: Model
The compiled vision model for processing images.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!