IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.idefics3

Idefics3 vision-language architecture for multimodal text generation.

Idefics3Config​

class max.pipelines.architectures.idefics3.Idefics3Config(*, devices, scale_factor, image_token_id, vision_config, text_config)

source

Bases: ArchVLConfigWithTextSubconfig, ArchConfigWithKVCache

Configuration for Idefics3 models.

Parameters:

devices​

devices: list[DeviceRef]

source

Devices that the Idefics3 model is parallelized over.

finalize()​

finalize(huggingface_config, llm_state_dict, return_logits, norm_method='rms_norm')

source

Finalize the Idefics3Config instance with state_dict dependent fields.

Parameters:

  • huggingface_config (AutoConfig) – HuggingFace model configuration.
  • llm_state_dict (dict[str, WeightData]) – Language model weights dictionary.
  • dtype – Data type for model parameters.
  • return_logits (ReturnLogits) – Return logits configuration.
  • norm_method (Literal['rms_norm', 'layer_norm']) – Normalization method.

Return type:

None

get_kv_params()​

get_kv_params()

source

Returns the KV cache parameters from the embedded text config.

Return type:

KVCacheParams

get_num_layers()​

static get_num_layers(huggingface_config)

source

Get number of layers in the language model.

Parameters:

huggingface_config (AutoConfig)

Return type:

int

image_seq_len​

property image_seq_len: int

source

Calculate the number of image tokens after connector processing.

image_token_id​

image_token_id: int

source

Token ID used to represent image tokens in the text sequence.

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes an Idefics3Config instance from pipeline configuration.

Parameters:

Returns:

An Idefics3Config instance with fields initialized from config.

Return type:

Self

scale_factor​

scale_factor: int

source

Scale factor for pixel shuffle operation in the connector.

text_config​

text_config: Llama3Config

source

Text model configuration (Llama3-based).

vision_config​

vision_config: Idefics3VisionConfig

source

Vision encoder configuration (SigLIP-based).

Idefics3Inputs​

class max.pipelines.architectures.idefics3.Idefics3Inputs(tokens, input_row_offsets, return_n_logits, pixel_values=None, image_token_indices=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

A class representing inputs for the Idefics3 model.

Parameters:

has_vision_inputs​

property has_vision_inputs: bool

source

Check if this input contains vision data.

image_token_indices​

image_token_indices: Buffer | None = None

source

Pre-computed indices of image tokens in the input sequence.

input_row_offsets​

input_row_offsets: Buffer

source

Tensor containing the offsets for each row in the ragged input sequence.

pixel_values​

pixel_values: Buffer | None = None

source

Pixel values for vision inputs.

return_n_logits​

return_n_logits: Buffer

source

Number of logits to return, used by speculative decoding for example.

tokens​

tokens: Buffer

source

Tensor containing the input token IDs.

Idefics3Model​

class max.pipelines.architectures.idefics3.Idefics3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: PipelineModelWithKVCache[TextAndVisionContext]

An Idefics3 pipeline model for multimodal text generation.

Parameters:

calculate_max_seq_len()​

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Uses max_length when set, else text_config.max_position_embeddings (config bounds).

Parameters:

Return type:

int

execute()​

execute(model_inputs)

source

Executes the Idefics3 model with the prepared inputs.

Parameters:

model_inputs (ModelInputs)

Return type:

ModelOutputs

language_model​

language_model: Model

source

The compiled language model for text generation.

load_model()​

load_model(session)

source

Loads the compiled Idefics3 models into the MAX Engine session.

Returns:

A tuple of (vision_model, language_model).

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]

model_config_cls​

model_config_cls

source

alias of Idefics3Config

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs for the first execution pass of the Idefics3 model.

Parameters:

Return type:

ModelInputs

vision_model​

vision_model: Model

source

The compiled vision model for processing images.

Idefics3VisionConfig​

class max.pipelines.architectures.idefics3.Idefics3VisionConfig(dtype, hidden_size, intermediate_size, image_size, patch_size, num_channels, num_attention_heads, head_dim, layer_norm_eps, hidden_act, num_hidden_layers, initializer_range, scale_factor, text_config_hidden_size)

source

Bases: object

Configuration for Idefics3 Vision Model (SigLIP-based).

Parameters:

  • dtype (DType)
  • hidden_size (int)
  • intermediate_size (int)
  • image_size (int)
  • patch_size (int)
  • num_channels (int)
  • num_attention_heads (int)
  • head_dim (int)
  • layer_norm_eps (float)
  • hidden_act (str)
  • num_hidden_layers (int)
  • initializer_range (float)
  • scale_factor (int)
  • text_config_hidden_size (int)

dtype​

dtype: DType

source

DType of the Idefics3 vision model weights.

head_dim​

head_dim: int

source

Dimension of each attention head.

hidden_act​

hidden_act: str

source

Activation function used in the vision encoder.

hidden_size​

hidden_size: int

source

Hidden size of the vision encoder.

image_size​

image_size: int

source

Input image size.

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, text_config_hidden_size)

source

Initialize Idefics3VisionConfig from HuggingFace config.

Parameters:

  • pipeline_config (PipelineConfig)
  • huggingface_config (AutoConfig)
  • text_config_hidden_size (int)

Return type:

Idefics3VisionConfig

initializer_range​

initializer_range: float

source

Standard deviation for weight initialization.

intermediate_size​

intermediate_size: int

source

Intermediate size in the vision encoder’s feed-forward layers.

layer_norm_eps​

layer_norm_eps: float

source

Epsilon for layer normalization.

num_attention_heads​

num_attention_heads: int

source

Number of attention heads in the vision encoder.

num_channels​

num_channels: int

source

Number of input channels (typically 3 for RGB).

num_hidden_layers​

num_hidden_layers: int

source

Number of hidden layers in the vision encoder.

patch_size​

patch_size: int

source

Vision transformer patch size.

scale_factor​

scale_factor: int

source

Scale factor for pixel shuffle operation in the connector.

text_config_hidden_size​

text_config_hidden_size: int

source

Hidden size from the text config for modality projection.