Python class

TextAndVisionContext

`TextAndVisionContext`

class max.pipelines.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)

source

Bases: TextContext

A base class for model context, specifically for Vision model variants.

For example:

- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99

Token array:

-       idx: [  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
                              ^-- img0 --^                  ^-- img1 --^
                                                 ^ start_idx=11 (image_idx=1)

Then we would have:

- ImageMetadata(start_idx=5, end_idx=9, ...)  # img0
- ImageMetadata(start_idx=15, end_idx=19, ...)  # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

When chunk prefill is not active, we restrict current_position from being in the middle of an image. This is verified in _validate_state which is called before and after mutating methods like _bump_token_indices. During chunked prefill the restriction is relaxed because the vision encoder cache ensures images are encoded once and reused across chunks.

Parameters:

max_length (int)
tokens (TokenBuffer)
request_id (RequestID)
eos_tracker (EOSTracker)
log_probabilities (int)
log_probabilities_echo (bool)
ignore_eos (bool)
json_schema (str | None)
sampling_params (SamplingParams)
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_log_probabilities_data (dict[int, LogProbabilities])
_is_initial_prompt (bool)
_draft_offset (int)
_spec_decoding_state (SpecDecodingState | None)
target_endpoint (str | None)
external_block_metadata (Any)
cached_prefix_length (int | None)
vision_token_ids (list[int])
images (list[ImageMetadata])
extra_model_args (dict[str, ndarray[tuple[Any, ...], dtype[Any]]])

`compute_image_aligned_idx()`

compute_image_aligned_idx(idx)

source

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:: idx (int)
Return type:: int

`extra_model_args`

extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

source

Extra model arguments for the vision model. These are model specific arguments.

`image_idx`

property image_idx: int

source

Index of the next unencoded image in the prompt.

`image_token_indices`

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens in the full token sequence.

Derived from images metadata. Subclasses that precompute indices at tokenization time (e.g. KimiK2.5, Qwen2.5VL) may override this with a stored field for efficiency.

`images`

images: list[ImageMetadata]

source

Metadata about each image in the prompt.

`needs_vision_encoding`

property needs_vision_encoding: bool

source

Returns whether vision encoding is needed for this context.

`next_images`

property next_images: list[ImageMetadata]

source

Returns the images that are not yet encoded.

`update()`

update(new_token, log_probabilities=None)

source

Updates the context with a new token and validates vision state.

Parameters:

new_token (int)
log_probabilities (LogProbabilities | None)

Return type:

None

`vision_token_ids`

vision_token_ids: list[int]

source

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.

TextAndVisionContext​

compute_image_aligned_idx()​

extra_model_args​

image_idx​

image_token_indices​

images​

needs_vision_encoding​

next_images​

update()​

vision_token_ids​