Python module
core
TTSContext
class max.pipelines.core.TTSContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE)
A context for Text-to-Speech (TTS) model inference.
This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.
-
Parameters:
-
- max_length (int)
- tokens (TokenBuffer)
- request_id (RequestID)
- eos_token_ids (set[int])
- eos_sequences (list[list[int]])
- log_probabilities (int)
- log_probabilities_echo (bool)
- ignore_eos (bool)
- json_schema (str | None)
- sampling_params (SamplingParams)
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _log_probabilities_data (dict[int, LogProbabilities])
- _is_initial_prompt (bool)
- _draft_offset (int)
- target_endpoint (str | None)
- audio_prompt_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]]) – Array of input audio prompt tokens used for voice cloning
- buffer_speech_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]] | None)
- audio_buffer (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)
- prev_samples_beyond_offset (int)
- streaming (bool) – Whether the request is streaming the audio to client
- _speech_token_size (int) – Size of the speech token buffer, defaults to SPEECH_TOKEN_audio_chunk_size
- _speech_token_end_idx (int) – Index marking the end of valid speech tokens
- _speech_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]]) – Buffer containing the generated speech tokens
- decoded_index (int)
- _block_counter (int) – Counter tracking number of speech token blocks generated
- _arrival_time (float)
- audio_generation_status (GenerationStatus)
audio_buffer
audio_buffer: ndarray[tuple[Any, ...], dtype[floating[Any]]] | None = None
audio_generation_status
audio_generation_status: GenerationStatus = 'active'
audio_prompt_tokens
audio_prompt_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]
block_counter
property block_counter: int
buffer_speech_tokens
buffer_speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]] | None = None
decoded_index
decoded_index: int = 0
is_done
property is_done: bool
next_speech_tokens()
next_speech_tokens(audio_chunk_size=None, buffer=None)
Returns a chunk of the next unseen speech tokens.
Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.
-
Parameters:
-
Returns:
-
A tuple of (chunk of speech tokens, buffer).
-
Return type:
prev_samples_beyond_offset
prev_samples_beyond_offset: int = 0
speech_tokens
property speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]
streaming
streaming: bool = False
update_speech_tokens()
update_speech_tokens(new_tokens)
Updates the next_tokens
TextAndVisionContext
class max.pipelines.core.TextAndVisionContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, vision_token_ids, images=<factory>, extra_model_args=<factory>)
A base class for model context, specifically for Vision model variants.
For example:
- <vision_start_token_id> = 97
- <vision_token_id> = 98
- <vision_end_token_id> = 99Token array:
- idx: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ]
- token_ids: [ 51 52 53 54 97 98 98 98 98 99 55 56 57 58 97 98 98 98 98 99 59 60 61 62 ]
^-- img0 --^ ^-- img1 --^
^ start_idx=11 (image_idx=1)Then we would have:
- ImageMetadata(start_idx=5, end_idx=9, ...) # img0
- ImageMetadata(start_idx=15, end_idx=19, ...) # img1These image ranges should be non-overlapping.
The image_idx is determined based on the value of start_idx. It is the idx of the first image that is not yet encoded. For example in the above diagram when start_idx=11, this implies that image_idx=1.
Currently we restrict start_idx and current_position from being in the middle of an image! This is verified in _validate_state methods that are called before and after mutating methods like _bump_token_indices.
-
Parameters:
-
- max_length (int)
- tokens (TokenBuffer)
- request_id (RequestID)
- eos_token_ids (set[int])
- eos_sequences (list[list[int]])
- log_probabilities (int)
- log_probabilities_echo (bool)
- ignore_eos (bool)
- json_schema (str | None)
- sampling_params (SamplingParams)
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _log_probabilities_data (dict[int, LogProbabilities])
- _is_initial_prompt (bool)
- _draft_offset (int)
- target_endpoint (str | None)
- vision_token_ids (list[int])
- images (list[ImageMetadata])
- extra_model_args (dict[str, ndarray[tuple[Any, ...], dtype[Any]]])
compute_image_aligned_idx()
compute_image_aligned_idx(idx)
Possibly aligns a index value downward if it lies in the middle of an image.
extra_model_args
extra_model_args: dict[str, ndarray[tuple[Any, ...], dtype[Any]]]
Extra model arguments for the vision model. These are model specific arguments.
image_idx
property image_idx: int
Index of the next unencoded image in the prompt.
images
images: list[ImageMetadata]
Metadata about each image in the prompt.
needs_vision_encoding
property needs_vision_encoding: bool
Returns whether vision encoding is needed for this context.
next_images
property next_images: list[ImageMetadata]
Returns the images that are not yet encoded.
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
vision_token_ids
The value of the <vision_token_id> special token. The reason this is a list is primarily due to Pixtral which also has a image_break_token_id.
TextContext
class max.pipelines.core.TextContext(*, max_length, tokens, request_id=<factory>, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)
A base class for model context, specifically for Text model variants.
This class manages the state and processing of text generation, including token management, caching, and generation parameters.
-
Parameters:
-
- max_length (int) – Maximum allowed length of the generated sequence
- tokens (TokenBuffer) – NumPy array containing the token IDs
- request_id (RequestID) – A unique identifier for this sequence.
- eos_token_ids (set[int]) – Set of token IDs that indicate end of sequence
- eos_sequences (list[list[int]])
- log_probabilities (int) – Whether to return token log probabilities
- log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
- ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
- json_schema (str | None) – Optional JSON schema for structured output
- sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
- _is_initial_prompt (bool) – Whether this is the initial prompt encoding
- _draft_offset (int) – Offset for draft decoding
- target_endpoint (str | None) – Optional target endpoint identifier for routing requests
apply_processing_offset()
apply_processing_offset(offset)
-
Parameters:
-
offset (int)
-
Return type:
-
None
compute_num_available_steps()
compute_num_available_steps(max_seq_len)
Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.
eos_sequences
eos_token_ids
get_min_token_logit_mask()
get_min_token_logit_mask(num_steps)
Returns a set of indices for the tokens in the output that should be masked.
This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.
ignore_eos
ignore_eos: bool = False
is_done
property is_done: bool
is_initial_prompt
property is_initial_prompt: bool
Returns true if the context has not been updated with tokens.
json_schema
jump_ahead()
jump_ahead(new_token)
Updates the token array, while ensuring the new token is returned to the user.
-
Parameters:
-
new_token (int)
-
Return type:
-
None
log_probabilities
log_probabilities: int = 0
log_probabilities_echo
log_probabilities_echo: bool = False
matcher
property matcher: LLMatcher | None
max_length
max_length: int
min_tokens
property min_tokens: int
The minimum number of new tokens to generate.
model_name
model_name: str = ''
request_id
request_id: RequestID
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt.
-
Return type:
-
None
sampling_params
sampling_params: SamplingParams
set_matcher()
set_matcher(matcher)
-
Parameters:
-
matcher (LLMatcher)
-
Return type:
-
None
status
status: GenerationStatus = 'active'
target_endpoint
to_generation_output()
to_generation_output()
Get completion tokens that are ready to be returned to the user.
This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.
-
Returns:
-
The completion tokens and their associated log probabilities, if available.
-
Return type:
tokens
tokens: TokenBuffer
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
reserve_token_space_for_batch()
max.pipelines.core.reserve_token_space_for_batch(batch, num_tokens)
Temporarily reserves token space for each context in a batch by incrementing the _active_idx and _end_idx attributes by num_tokens for the duration of the context. These indices are restored to their original values upon exit. :param batch: List of TextContext objects to reserve space for. :param num_tokens: Number of tokens to reserve for each context.
-
Yields:
-
None
-
Parameters:
-
- batch (list[TextContext])
- num_tokens (int)
-
Return type:
-
Iterator[None]
validate_aspect_ratio_args()
max.pipelines.core.validate_aspect_ratio_args(context)
Validates that required aspect ratio arguments are present for vision input.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If required aspect ratio arguments are missing.
-
Return type:
-
None
validate_image_grid_thw_args()
max.pipelines.core.validate_image_grid_thw_args(context)
Validates that image_grid_thw is present when vision encoding is needed.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If image_grid_thw is missing from extra_model_args when vision encoding is needed.
-
Return type:
-
None
validate_image_shape_5d()
max.pipelines.core.validate_image_shape_5d(context)
Validates that images have the expected 5-dimensional shape.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If the image shape is not 5-dimensional.
-
Return type:
-
None
validate_initial_prompt_has_image()
max.pipelines.core.validate_initial_prompt_has_image(context)
Validates that initial prompts contain an image for vision models.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If the initial prompt doesn’t contain an image.
-
Return type:
-
None
validate_only_one_image()
max.pipelines.core.validate_only_one_image(context)
Validates that at most one image is provided in the context.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If more than one image is provided.
-
Return type:
-
None
validate_requires_vision_context()
max.pipelines.core.validate_requires_vision_context(context)
Validates that the context is a TextAndVisionContext.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If the context is not a TextAndVisionContext.
-
Return type:
-
None
validate_vision_position_ids()
max.pipelines.core.validate_vision_position_ids(context)
Validates that vision_position_ids is present when vision encoding is needed.
-
Parameters:
-
context (TextContext | TextAndVisionContext) – The context to validate.
-
Raises:
-
InputError – If vision_position_ids is missing from extra_model_args when vision encoding is needed.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!