Python class

VLMTextGenerationContext

`VLMTextGenerationContext`

class max.interfaces.VLMTextGenerationContext(*args, **kwargs)

source

Bases: TextGenerationContext, Protocol

Protocol defining the interface for VLM input contexts.

`compute_image_aligned_idx()`

compute_image_aligned_idx(idx)

source

Aligns an index downward to avoid splitting an image token span.

If idx falls within the token range occupied by an image, this method returns the start_idx of that image so that the split point does not cut through image tokens. If idx does not land inside any image span, it is returned unchanged.

Parameters:: idx (int) – The candidate index into the token sequence.
Returns:: The adjusted index, guaranteed not to split an image token span.
Return type:: int

`image_idx`

property image_idx: int

source

Index of the next unencoded image in the prompt.

`image_token_indices`

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens within this context’s token buffer.

Offsets are relative to the start of the full token sequence (not the active window). Used by compute_multimodal_merge_indices to build batch-level scatter indices that account for processed_length.

`images`

property images: list[ImageMetadata]

source

The images in the context.

`needs_vision_encoding`

property needs_vision_encoding: bool

source

Whether vision encoding is needed for this context.

`next_images`

property next_images: list[ImageMetadata]

source

The images that are not yet encoded.

VLMTextGenerationContext​

compute_image_aligned_idx()​

image_idx​

image_token_indices​

images​

needs_vision_encoding​

next_images​