Skip to main content

Python class

VLMTextGenerationContext

VLMTextGenerationContext

class max.interfaces.VLMTextGenerationContext(*args, **kwargs)

source

Bases: TextGenerationContext, Protocol

Protocol defining the interface for VLM input contexts.

compute_image_aligned_idx()

compute_image_aligned_idx(idx)

source

Aligns an index downward to avoid splitting an image token span.

If idx falls within the token range occupied by an image, this method returns the start_idx of that image so that the split point does not cut through image tokens. If idx does not land inside any image span, it is returned unchanged.

Parameters:

idx (int) – The candidate index into the token sequence.

Returns:

The adjusted index, guaranteed not to split an image token span.

Return type:

int

image_idx

property image_idx: int

source

Index of the next unencoded image in the prompt.

image_token_indices

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens within this context’s token buffer.

Offsets are relative to the start of the full token sequence (not the active window). Used by compute_multimodal_merge_indices to build batch-level scatter indices that account for processed_length.

images

property images: list[ImageMetadata]

source

The images in the context.

needs_vision_encoding

property needs_vision_encoding: bool

source

Whether vision encoding is needed for this context.

next_images

property next_images: list[ImageMetadata]

source

The images that are not yet encoded.