Skip to main content

Python module

core

PixelContext

class max.pipelines.core.PixelContext(*, tokens, request_id=<factory>, model_name='', mask=None, tokens_2=None, negative_tokens=None, negative_tokens_2=None, extra_params=<factory>, timesteps=<factory>, sigmas=<factory>, latents=<factory>, latent_image_ids=<factory>, height=1024, width=1024, num_inference_steps=50, guidance_scale=3.5, guidance=None, true_cfg_scale=1.0, num_warmup_steps=0, num_images_per_prompt=1, status=GenerationStatus.ACTIVE)

A model-ready context for image/video generation requests.

Per the design doc, this class contains only numeric data that the model will execute against. User-facing strings (prompt, negative_prompt) are consumed during tokenization and do not appear here.

All preprocessing is performed by PixelGenerationTokenizer.new_context():

  • Prompt tokenization -> tokens field
  • Negative prompt tokenization -> negative_tokens field
  • Timestep schedule computation -> timesteps field
  • Initial noise generation -> latents field

Parameters:

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute number of available steps for scheduler compatibility.

For image and video generation, this returns the number of inference steps.

Parameters:

max_seq_len (int)

Return type:

int

extra_params

extra_params: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

Model-specific numeric parameters (e.g., cfg_normalization values).

guidance

guidance: ndarray[tuple[Any, ...], dtype[float32]] | None = None

guidance_scale

guidance_scale: float = 3.5

height

height: int = 1024

is_done

property is_done: bool

Whether the request has completed generation.

latent_image_ids

latent_image_ids: ndarray[tuple[Any, ...], dtype[float32]]

Precomputed latent image IDs for generation.

latents

latents: ndarray[tuple[Any, ...], dtype[float32]]

Precomputed initial noise (latents) for generation.

mask

mask: ndarray[tuple[Any, ...], dtype[bool]] | None = None

Mask for text encoder’s attention.

model_name

model_name: str = ''

negative_tokens

negative_tokens: TokenBuffer | None = None

Negative tokens for primary encoder.

negative_tokens_2

negative_tokens_2: TokenBuffer | None = None

Negative tokens for secondary encoder. None for single-encoder models.

num_images_per_prompt

num_images_per_prompt: int = 1

num_inference_steps

num_inference_steps: int = 50

num_warmup_steps

num_warmup_steps: int = 0

request_id

request_id: RequestID

reset()

reset()

Resets the context’s state.

Return type:

None

sigmas

sigmas: ndarray[tuple[Any, ...], dtype[float32]]

Precomputed sigmas schedule for denoising.

status

status: GenerationStatus = 'active'

timesteps

timesteps: ndarray[tuple[Any, ...], dtype[float32]]

Precomputed timesteps schedule for denoising.

to_generation_output()

to_generation_output()

Convert this context to a GenerationOutput object.

Return type:

GenerationOutput

tokens

tokens: TokenBuffer

Primary encoder tokens.

tokens_2

tokens_2: TokenBuffer | None = None

Secondary encoder tokens. None for single-encoder models.

true_cfg_scale

true_cfg_scale: float = 1.0

update()

update(latents)

Update the context with newly generated latents/image data.

Parameters:

latents (ndarray[tuple[Any, ...], dtype[Any]])

Return type:

None

width

width: int = 1024

TTSContext

class max.pipelines.core.TTSContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE)

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

audio_buffer

audio_buffer: ndarray[tuple[Any, ...], dtype[floating[Any]]] | None = None

audio_generation_status

audio_generation_status: GenerationStatus = 'active'

audio_prompt_tokens

audio_prompt_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

block_counter

property block_counter: int

buffer_speech_tokens

buffer_speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]] | None = None

decoded_index

decoded_index: int = 0

is_done

property is_done: bool

next_speech_tokens()

next_speech_tokens(audio_chunk_size=None, buffer=None)

Returns a chunk of the next unseen speech tokens.

Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.

Parameters:

  • audio_chunk_size (int | None) – The number of speech tokens to return.
  • buffer (int | None) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray[tuple[Any, …], dtype[integer[Any]]], int]

prev_samples_beyond_offset

prev_samples_beyond_offset: int = 0

speech_tokens

property speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

streaming

streaming: bool = False

update_speech_tokens()

update_speech_tokens(new_tokens)

Updates the next_tokens

Parameters:

new_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]])

Return type:

None

TextAndVisionContext

class max.pipelines.core.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)

A base class for model context, specifically for Vision model variants.

For example:

- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99

Token array:

-       idx: [  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
                              ^-- img0 --^                  ^-- img1 --^
                                                 ^ start_idx=11 (image_idx=1)

Then we would have:

- ImageMetadata(start_idx=5, end_idx=9, ...)  # img0
- ImageMetadata(start_idx=15, end_idx=19, ...)  # img1

These image ranges should be non-overlapping.

The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.

Currently we restrict start_idx and current_position from being in the middle of an image! This is verified in _validate_state methods that are called before and after mutating methods like _bump_token_indices.

Parameters:

compute_image_aligned_idx()

compute_image_aligned_idx(idx)

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:

idx (int)

Return type:

int

extra_model_args

extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]

Extra model arguments for the vision model. These are model specific arguments.

image_idx

property image_idx: int

Index of the next unencoded image in the prompt.

images

images: list[ImageMetadata]

Metadata about each image in the prompt.

needs_vision_encoding

property needs_vision_encoding: bool

Returns whether vision encoding is needed for this context.

next_images

property next_images: list[ImageMetadata]

Returns the images that are not yet encoded.

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

vision_token_ids

vision_token_ids: list[int]

The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.

TextContext

class max.pipelines.core.TextContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)

A base class for model context, specifically for Text model variants.

This class manages the state and processing of text generation, including token management, caching, and generation parameters.

Parameters:

  • max_length (int) – Maximum allowed length of the generated sequence
  • tokens (TokenBuffer) – NumPy array containing the token IDs
  • request_id (RequestID) – A unique identifier for this sequence.
  • eos_token_ids (set[int]) – Set of token IDs that indicate end of sequence
  • eos_sequences (list[list[int]])
  • log_probabilities (int) – Whether to return token log probabilities
  • log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
  • ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
  • json_schema (str | None) – Optional JSON schema for structured output
  • sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
  • model_name (str)
  • _matcher (Any | None)
  • status (GenerationStatus)
  • _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
  • _is_initial_prompt (bool) – Whether this is the initial prompt encoding
  • _draft_offset (int) – Offset for draft decoding
  • target_endpoint (str | None) – Optional target endpoint identifier for routing requests

apply_processing_offset()

apply_processing_offset(offset)

Parameters:

offset (int)

Return type:

None

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:

max_seq_len (int)

Return type:

int

eos_sequences

eos_sequences: list[list[int]]

eos_token_ids

eos_token_ids: set[int]

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

Returns a set of indices for the tokens in the output that should be masked.

This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.

Returns:

A set of indices for the tokens in the output that should be masked.

Parameters:

num_steps (int)

Return type:

list[ndarray[tuple[Any, …], dtype[int32]]]

ignore_eos

ignore_eos: bool = False

is_done

property is_done: bool

is_initial_prompt

property is_initial_prompt: bool

Returns true if the context has not been updated with tokens.

json_schema

json_schema: str | None = None

jump_ahead()

jump_ahead(new_token)

Updates the token array, while ensuring the new token is returned to the user.

Parameters:

new_token (int)

Return type:

None

log_probabilities

log_probabilities: int = 0

log_probabilities_echo

log_probabilities_echo: bool = False

matcher

property matcher: LLMatcher | None

max_length

max_length: int

min_tokens

property min_tokens: int

The minimum number of new tokens to generate.

model_name

model_name: str = ''

realize_future_token()

realize_future_token(new_token, log_probabilities=None)

Overwrite the placeholder future token with the actual token.

This is primarily used for overlap scheduling.

Parameters:

Return type:

None

request_id

request_id: RequestID

reset()

reset()

Resets the context’s state by combining all tokens into a new prompt.

Return type:

None

sampling_params

sampling_params: SamplingParams

set_matcher()

set_matcher(matcher)

Parameters:

matcher (LLMatcher)

Return type:

None

status

status: GenerationStatus = 'active'

target_endpoint

target_endpoint: str | None = None

to_generation_output()

to_generation_output()

Get completion tokens that are ready to be returned to the user.

This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

Returns:

The completion tokens and their associated log probabilities, if available.

Return type:

TextGenerationOutput

tokens

tokens: TokenBuffer

update()

update(new_token, log_probabilities=None)

Updates the next_tokens and extends existing tokens to include all generated tokens.

Parameters:

Return type:

None

update_with_future_token()

update_with_future_token()

Append a placeholder future token to the generated tokens.

This is primarily used for overlap scheduling.

Return type:

None

reserve_token_space_for_batch()

max.pipelines.core.reserve_token_space_for_batch(batch, num_tokens)

Temporarily reserves token space for each context in a batch by incrementing the _active_idx and _end_idx attributes by num_tokens for the duration of the context. These indices are restored to their original values upon exit. :param batch: List of TextContext objects to reserve space for. :param num_tokens: Number of tokens to reserve for each context.

Yields:

None

Parameters:

Return type:

Iterator[None]

validate_aspect_ratio_args()

max.pipelines.core.validate_aspect_ratio_args(context)

Validates that required aspect ratio arguments are present for vision input.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If required aspect ratio arguments are missing.

Return type:

None

validate_image_grid_thw_args()

max.pipelines.core.validate_image_grid_thw_args(context)

Validates that image_grid_thw is present when vision encoding is needed.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If image_grid_thw is missing from extra_model_args when vision encoding is needed.

Return type:

None

validate_image_shape_5d()

max.pipelines.core.validate_image_shape_5d(context)

Validates that images have the expected 5-dimensional shape.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If the image shape is not 5-dimensional.

Return type:

None

validate_initial_prompt_has_image()

max.pipelines.core.validate_initial_prompt_has_image(context)

Validates that initial prompts contain an image for vision models.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If the initial prompt doesn’t contain an image.

Return type:

None

validate_only_one_image()

max.pipelines.core.validate_only_one_image(context)

Validates that at most one image is provided in the context.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If more than one image is provided.

Return type:

None

validate_requires_vision_context()

max.pipelines.core.validate_requires_vision_context(context)

Validates that the context is a TextAndVisionContext.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If the context is not a TextAndVisionContext.

Return type:

None

validate_vision_position_ids()

max.pipelines.core.validate_vision_position_ids(context)

Validates that vision_position_ids is present when vision encoding is needed.

Parameters:

context (TextContext | TextAndVisionContext) – The context to validate.

Raises:

InputError – If vision_position_ids is missing from extra_model_args when vision encoding is needed.

Return type:

None

Was this page helpful?