Skip to main content

Python class

TextAndVisionContext

TextAndVisionContext

class max.pipelines.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)

source

Bases: TextContext

A base class for model context, specifically for Vision model variants.

For example:

- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99

Token array:

-       idx: [  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
                              ^-- img0 --^                  ^-- img1 --^
                                                 ^ start_idx=11 (image_idx=1)

Then we would have:

- ImageMetadata(start_idx=5, end_idx=9, ...)  # img0
- ImageMetadata(start_idx=15, end_idx=19, ...)  # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

When chunk prefill is not active, we restrict current_position from being in the middle of an image. This is verified in _validate_state which is called before and after mutating methods like _bump_token_indices. During chunked prefill the restriction is relaxed because the vision encoder cache ensures images are encoded once and reused across chunks.

Parameters:

compute_image_aligned_idx()

compute_image_aligned_idx(idx)

source

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:

idx (int)

Return type:

int

extra_model_args

extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

source

Extra model arguments for the vision model. These are model specific arguments.

image_idx

property image_idx: int

source

Index of the next unencoded image in the prompt.

image_token_indices

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens in the full token sequence.

Derived from images metadata. Subclasses that precompute indices at tokenization time (e.g. KimiK2.5, Qwen2.5VL) may override this with a stored field for efficiency.

images

images: list[ImageMetadata]

source

Metadata about each image in the prompt.

needs_vision_encoding

property needs_vision_encoding: bool

source

Returns whether vision encoding is needed for this context.

next_images

property next_images: list[ImageMetadata]

source

Returns the images that are not yet encoded.

update()

update(new_token, log_probabilities=None)

source

Updates the context with a new token and validates vision state.

Parameters:

Return type:

None

vision_token_ids

vision_token_ids: list[int]

source

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.