For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

TextAndVisionContext

`TextAndVisionContext`

class max.pipelines.context.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, vocab_size=None, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, grammar=None, grammar_state=<factory>, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _is_padding_ctx=False, _draft_offset=0, _spec_decoding_state=None, in_reasoning_phase=False, target_endpoint=None, external_block_metadata=None, cache_salt=None, dkv_hint_instance_name='', cached_prefix_length=None, _cache_metrics_emitted=False, _pending_future_count=0, vision_token_ids, images=<factory>, token_hash_overrides=<factory>, extra_model_args=<factory>)

source

Bases: TextContext

A base class for model context, specifically for Vision model variants.

For example:

- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99

Token array:

-       idx: [  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
                              ^-- img0 --^                  ^-- img1 --^
                                                 ^ start_idx=11 (image_idx=1)

Then we would have:

- ImageMetadata(start_idx=5, end_idx=9, ...)  # img0
- ImageMetadata(start_idx=15, end_idx=19, ...)  # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

When chunk prefill is not active, we restrict current_position from being in the middle of an image. This is verified in _validate_state which is called before and after mutating methods like _bump_token_indices. During chunked prefill the restriction is relaxed because the vision encoder cache ensures images are encoded once and reused across chunks.

Parameters:

max_length (int)
tokens (TokenBuffer)
request_id (RequestID)
eos_tracker (EOSTracker)
vocab_size (int | None)
log_probabilities (int)
log_probabilities_echo (bool)
ignore_eos (bool)
json_schema (str | None)
grammar (str | None)
grammar_state (GrammarEnforcementState)
sampling_params (SamplingParams)
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_log_probabilities_data (dict[int, LogProbabilities])
_is_initial_prompt (bool)
_is_padding_ctx (bool)
_draft_offset (int)
_spec_decoding_state (SpecDecodingState | None)
in_reasoning_phase (bool)
target_endpoint (str | None)
external_block_metadata (Any)
cache_salt (str | None)
dkv_hint_instance_name (str)
cached_prefix_length (int | None)
_cache_metrics_emitted (bool)
_pending_future_count (int)
vision_token_ids (list[int])
images (list[ImageMetadata])
token_hash_overrides (list[TokenHashOverride])
extra_model_args (dict[str, ndarray[tuple[Any, ...], dtype[Any]]])

`compute_image_aligned_idx()`

compute_image_aligned_idx(idx)

source

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:: idx (int)
Return type:: int

`extra_model_args`

extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

source

Extra model arguments for the vision model. These are model specific arguments.

`image_idx`

property image_idx: int

source

Index of the next unencoded image in the prompt.

`image_token_indices`

property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]

source

Positions of image-placeholder tokens in the full token sequence.

Derived from images metadata. Subclasses that precompute indices at tokenization time (e.g. KimiK2.5, Qwen2.5VL) may override this with a stored field for efficiency.

`images`

images: list[ImageMetadata]

source

Metadata about each image in the prompt.

`needs_vision_encoding`

property needs_vision_encoding: bool

source

Returns whether vision encoding is needed for this context.

`next_images`

property next_images: list[ImageMetadata]

source

Returns the images that are not yet encoded.

`token_hash_overrides`

token_hash_overrides: list[TokenHashOverride]

source

Token-level content hashes to inject into prefix-cache block hashing.

`update()`

update(new_token, log_probabilities=None)

source

Updates the context with a new token and validates vision state.

Parameters:

new_token (int)
log_probabilities (LogProbabilities | None)

Return type:

None

`vision_token_ids`

vision_token_ids: list[int]

source

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.

TextAndVisionContext​

compute_image_aligned_idx()​

extra_model_args​

image_idx​

image_token_indices​

images​

needs_vision_encoding​

next_images​

token_hash_overrides​

update()​

vision_token_ids​