Python class
TextAndVisionContext
TextAndVisionContext
class max.pipelines.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)
Bases: TextContext
A base class for model context, specifically for Vision model variants.
For example:
- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99Token array:
- idx: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
^-- img0 --^ ^-- img1 --^
^ start_idx=11 (image_idx=1)Then we would have:
- ImageMetadata(start_idx=5, end_idx=9, ...) # img0
- ImageMetadata(start_idx=15, end_idx=19, ...) # img1These image ranges should be non-overlapping.
The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.
When chunk prefill is not active, we restrict current_position from being in the middle of an image. This is verified in _validate_state which is called before and after mutating methods like _bump_token_indices. During chunked prefill the restriction is relaxed because the vision encoder cache ensures images are encoded once and reused across chunks.
-
Parameters:
-
- max_length (int)
- tokens (TokenBuffer)
- request_id (RequestID)
- eos_tracker (EOSTracker)
- log_probabilities (int)
- log_probabilities_echo (bool)
- ignore_eos (bool)
- json_schema (str | None)
- sampling_params (SamplingParams)
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _log_probabilities_data (dict[int, LogProbabilities])
- _is_initial_prompt (bool)
- _draft_offset (int)
- _spec_decoding_state (SpecDecodingState | None)
- target_endpoint (str | None)
- external_block_metadata (Any)
- cached_prefix_length (int | None)
- vision_token_ids (list[int])
- images (list[ImageMetadata])
- extra_model_args (dict[str, ndarray[tuple[Any, ...], dtype[Any]]])
compute_image_aligned_idx()
compute_image_aligned_idx(idx)
Possibly aligns a index value downward if it lies in the middle of an image.
extra_model_args
extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]
Extra model arguments for the vision model. These are model specific arguments.
image_idx
property image_idx: int
Index of the next unencoded image in the prompt.
image_token_indices
property image_token_indices: ndarray[tuple[Any, ...], dtype[int32]]
Positions of image-placeholder tokens in the full token sequence.
Derived from images metadata. Subclasses that precompute indices
at tokenization time (e.g. KimiK2.5, Qwen2.5VL) may override this
with a stored field for efficiency.
images
images: list[ImageMetadata]
Metadata about each image in the prompt.
needs_vision_encoding
property needs_vision_encoding: bool
Returns whether vision encoding is needed for this context.
next_images
property next_images: list[ImageMetadata]
Returns the images that are not yet encoded.
update()
update(new_token, log_probabilities=None)
Updates the context with a new token and validates vision state.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
vision_token_ids
The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!