IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

TextAndVisionContext

TextAndVisionContext​

class max.pipelines.context.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, vocab_size=None, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, grammar=None, grammar_state=<factory>, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, in_reasoning_phase=False, target_endpoint=None, external_block_metadata=None, dkv_hint_instance_name='', cached_prefix_length=None, _cache_metrics_emitted=False, vision_token_ids, images=<factory>, extra_model_args=<factory>)

source

Bases: TextContext

A base class for model context, specifically for Vision model variants.

For example:

- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99

Token array:

-       idx: [  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
                              ^-- img0 --^                  ^-- img1 --^
                                                 ^ start_idx=11 (image_idx=1)

Then we would have:

- ImageMetadata(start_idx=5, end_idx=9, ...)  # img0
- ImageMetadata(start_idx=15, end_idx=19, ...)  # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

When chunk prefill is not active, we restrict current_position from being in the middle of an image. This is verified in _validate_state which is called before and after mutating methods like _bump_token_indices. During chunked prefill the restriction is relaxed because the vision encoder cache ensures images are encoded once and reused across chunks.

Parameters:

compute_image_aligned_idx()​

compute_image_aligned_idx(idx)

source

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:

idx (int)

Return type:

int

extra_model_args​

extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

source

Extra model arguments for the vision model. These are model specific arguments.

image_idx​

property image_idx: int

source

Index of the next unencoded image in the prompt.

image_token_indices​

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens in the full token sequence.

Derived from images metadata. Subclasses that precompute indices at tokenization time (e.g. KimiK2.5, Qwen2.5VL) may override this with a stored field for efficiency.

images​

images: list[ImageMetadata]

source

Metadata about each image in the prompt.

needs_vision_encoding​

property needs_vision_encoding: bool

source

Returns whether vision encoding is needed for this context.

next_images​

property next_images: list[ImageMetadata]

source

Returns the images that are not yet encoded.

update()​

update(new_token, log_probabilities=None)

source

Updates the context with a new token and validates vision state.

Parameters:

Return type:

None

vision_token_ids​

vision_token_ids: list[int]

source

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.