IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.idefics3_modulev3

Idefics3 vision-language architecture for multimodal text generation.

Idefics3Config​

class max.pipelines.architectures.idefics3_modulev3.Idefics3Config(*, devices, scale_factor, image_token_id, vision_config, text_config)

source

Bases: ArchVLConfigWithTextSubconfig, ArchConfigWithKVCache

Configuration for Idefics3 models (ModuleV3).

Parameters:

devices​

devices: list[DeviceRef]

source

Devices that the Idefics3 model is parallelized over.

finalize()​

finalize(huggingface_config, llm_state_dict, return_logits, return_hidden_states=ReturnHiddenStates.NONE, norm_method='rms_norm')

source

Finalize the Idefics3Config with state_dict-dependent fields.

Parameters:

Return type:

None

get_kv_params()​

get_kv_params()

source

Returns the KV cache parameters from the embedded text config.

Return type:

KVCacheParams

get_num_layers()​

static get_num_layers(huggingface_config)

source

Get number of layers in the language model.

Parameters:

huggingface_config (AutoConfig)

Return type:

int

image_seq_len​

property image_seq_len: int

source

Calculate the number of image tokens after connector processing.

image_token_id​

image_token_id: int

source

Token ID used to represent image tokens in the text sequence.

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initializes an Idefics3Config instance from pipeline configuration.

Parameters:

Return type:

Self

scale_factor​

scale_factor: int

source

Scale factor for pixel shuffle operation in the connector.

text_config​

text_config: Llama3Config

source

Text model configuration (Llama3-based).

vision_config​

vision_config: Idefics3VisionConfig

source

Vision encoder configuration (SigLIP-based).

Idefics3Inputs​

class max.pipelines.architectures.idefics3_modulev3.Idefics3Inputs(tokens, input_row_offsets, return_n_logits, pixel_values=None, image_token_indices=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None)

source

Bases: ModelInputs

Inputs for the Idefics3 model.

Parameters:

has_vision_inputs​

property has_vision_inputs: bool

source

image_token_indices​

image_token_indices: Buffer | None = None

source

input_row_offsets​

input_row_offsets: Buffer

source

pixel_values​

pixel_values: Buffer | None = None

source

return_n_logits​

return_n_logits: Buffer

source

tokens​

tokens: Buffer

source

Idefics3Model​

class max.pipelines.architectures.idefics3_modulev3.Idefics3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: PipelineModelWithKVCache[TextAndVisionContext]

An Idefics3 pipeline model using the ModuleV3 API.

Parameters:

calculate_max_seq_len()​

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Uses max_length when set, else text_config.max_position_embeddings (config bounds).

Parameters:

Return type:

int

execute()​

execute(model_inputs)

source

Execute the Idefics3 model.

Parameters:

model_inputs (ModelInputs)

Return type:

ModelOutputs

language_model​

language_model: Callable[..., Any]

source

The compiled language model.

load_model()​

load_model()

source

Compile vision and language models using the V3 API.

Returns:

A tuple of (compiled_vision_model, compiled_language_model).

Return type:

tuple[Callable[[…], Any], Callable[[…], Any]]

model_config_cls​

model_config_cls

source

alias of Idefics3Config

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the initial inputs for the first execution pass.

Parameters:

Return type:

ModelInputs

vision_model​

vision_model: Callable[..., Any]

source

The compiled vision model.