Python module
core
TTSContext
class max.pipelines.core.TTSContext(audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE, *, request_id=<factory>, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)
A context for Text-to-Speech (TTS) model inference.
This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.
-
Parameters:
-
- audio_prompt_tokens (ndarray[Any, dtype[integer[Any]]]) – Array of input audio prompt tokens used for voice cloning
- buffer_speech_tokens (ndarray[Any, dtype[integer[Any]]] | None)
- audio_buffer (ndarray[Any, dtype[floating[Any]]] | None)
- prev_samples_beyond_offset (int)
- streaming (bool) – Whether the request is streaming the audio to client
- _speech_token_size (int) – Size of the speech token buffer, defaults to SPEECH_TOKEN_audio_chunk_size
- _speech_token_end_idx (int) – Index marking the end of valid speech tokens
- _speech_tokens (ndarray[Any, dtype[integer[Any]]]) – Buffer containing the generated speech tokens
- decoded_index (int)
- _block_counter (int) – Counter tracking number of speech token blocks generated
- _arrival_time (float)
- audio_generation_status (GenerationStatus)
- request_id (str)
- max_length (int)
- tokens (ndarray[Any, dtype[integer[Any]]])
- eos_token_ids (set[int])
- eos_sequences (list[list[int]])
- log_probabilities (int)
- log_probabilities_echo (bool)
- ignore_eos (bool)
- json_schema (str | None)
- sampling_params (SamplingParams)
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _size (int)
- _start_idx (int)
- _active_idx (int)
- _end_idx (int)
- _completion_start_idx (int)
- _completion_end_idx (int)
- _prompt_len (int)
- _log_probabilities_data (dict[int, LogProbabilities])
- _is_initial_prompt (bool)
- _draft_offset (int)
- target_endpoint (str | None)
audio_buffer
audio_generation_status
audio_generation_status: GenerationStatus
audio_prompt_tokens
block_counter
property block_counter: int
buffer_speech_tokens
buffer_speech_tokens: ndarray[Any, dtype[integer[Any]]] | None
decoded_index
decoded_index: int
has_undecoded_speech_tokens()
has_undecoded_speech_tokens(exclude_last_n=0)
Checks whether there are undecoded speech tokens.
-
Parameters:
-
exclude_last_n (int) – Number of tokens to exclude from the end when checking for undecoded tokens. For example, if set to 1, the last token will not be considered when checking for undecoded tokens.
-
Returns:
-
True if there are undecoded speech tokens (excluding the last n tokens), False otherwise.
-
Return type:
is_done
property is_done: bool
next_speech_tokens()
next_speech_tokens(audio_chunk_size=None, buffer=None)
Returns a chunk of the next unseen speech tokens.
Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.
-
Parameters:
-
Returns:
-
A tuple of (chunk of speech tokens, buffer).
-
Return type:
prev_samples_beyond_offset
prev_samples_beyond_offset: int
speech_tokens
streaming
streaming: bool
update_speech_tokens()
update_speech_tokens(new_tokens)
Updates the next_tokens
TextAndVisionContext
class max.pipelines.core.TextAndVisionContext(*, request_id=<factory>, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None, pixel_values=(), extra_model_args=<factory>, _needs_vision_encoding=True)
A base class for model context, specifically for Vision model variants.
-
Parameters:
-
- request_id (str)
- max_length (int)
- tokens (ndarray[Any, dtype[integer[Any]]])
- eos_token_ids (set[int])
- eos_sequences (list[list[int]])
- log_probabilities (int)
- log_probabilities_echo (bool)
- ignore_eos (bool)
- json_schema (str | None)
- sampling_params (SamplingParams)
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _size (int)
- _start_idx (int)
- _active_idx (int)
- _end_idx (int)
- _completion_start_idx (int)
- _completion_end_idx (int)
- _prompt_len (int)
- _log_probabilities_data (dict[int, LogProbabilities])
- _is_initial_prompt (bool)
- _draft_offset (int)
- target_endpoint (str | None)
- pixel_values (tuple[ndarray[Any, dtype[floating[Any]]], ...])
- extra_model_args (dict[str, ndarray[Any, dtype[Any]]])
- _needs_vision_encoding (bool)
extra_model_args
needs_vision_encoding
property needs_vision_encoding: bool
Gets whether vision encoding is needed for this context.
pixel_values
pixel_values: tuple[ndarray[Any, dtype[floating[Any]]], ...]
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt.
-
Return type:
-
None
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
TextContext
class max.pipelines.core.TextContext(*, request_id=<factory>, max_length, tokens, eos_token_ids=<factory>, eos_sequences=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _size=-1, _start_idx=0, _active_idx=-1, _end_idx=-1, _completion_start_idx=-1, _completion_end_idx=-1, _prompt_len=-1, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, target_endpoint=None)
A base class for model context, specifically for Text model variants.
This class manages the state and processing of text generation, including token management, caching, and generation parameters.
-
Parameters:
-
- request_id (str) – A unique identifier for this sequence.
- max_length (int) – Maximum allowed length of the generated sequence
- tokens (ndarray[Any, dtype[integer[Any]]]) – NumPy array containing the token IDs
- eos_token_ids (set[int]) – Set of token IDs that indicate end of sequence
- eos_sequences (list[list[int]])
- log_probabilities (int) – Whether to return token log probabilities
- log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
- ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
- json_schema (str | None) – Optional JSON schema for structured output
- sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _size (int) – Current allocated size of token array
- _start_idx (int) – Start index of current generation window
- _active_idx (int) – Current position in token sequence
- _end_idx (int) – End index of valid tokens
- _completion_start_idx (int) – Start index of completion tokens
- _completion_end_idx (int) – End index of completion tokens
- _prompt_len (int) – Length of original prompt
- _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
- _is_initial_prompt (bool) – Whether this is the initial prompt encoding
- _draft_offset (int) – Offset for draft decoding
- target_endpoint (str | None) – Optional target endpoint identifier for routing requests
active_idx
property active_idx: int
active_length
property active_length: int
num tokens input this iteration.
This will be the prompt size for context encoding, and simply 1 (or more) for token generation.
-
Type:
-
Current sequence length
all_tokens
bump_token_indices()
bump_token_indices(start_idx=0, active_idx=0, end_idx=0)
Update the start_idx, active_idx and end_idx without manipulating the token array.
compute_num_available_steps()
compute_num_available_steps(max_seq_len)
Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.
current_length
property current_length: int
The current length of the sequence, including completed and active tokens.
end_idx
property end_idx: int
eos_sequences
eos_token_ids
generated_tokens
property generated_tokens: ndarray[Any, dtype[integer[Any]]]
Returns all tokens that have been generated after the prompt.
-
Returns:
-
Array of generated tokens from prompt_len to end_idx.
-
Return type:
-
np.ndarray
get_min_token_logit_mask()
get_min_token_logit_mask(num_steps)
Returns a set of indices for the tokens in the output that should be masked.
This is primarily used for the min_tokens setting, where we mask eos tokens in the logits to avoid generating them before we reach min_tokens.
ignore_eos
ignore_eos: bool
is_done
property is_done: bool
is_initial_prompt
property is_initial_prompt: bool
Returns true if the context has not been updated with tokens.
json_schema
jump_ahead()
jump_ahead(new_token)
Updates the token array, while ensuring the new token is returned to the user.
-
Parameters:
-
new_token (int)
-
Return type:
-
None
log_probabilities
log_probabilities: int
log_probabilities_echo
log_probabilities_echo: bool
matcher
property matcher: LLMatcher | None
max_length
max_length: int
min_tokens
property min_tokens: int
The minimum number of new tokens to generate.
model_name
model_name: str
needs_ce
property needs_ce: bool
Returns whether this context needs context encoding (CE).
CE mode indicates that the context has additional prompt tokens to encode.
-
Returns:
-
True if the context needs CE, False otherwise.
-
Return type:
next_tokens
Returns the tokens between start_idx and active_idx.
-
Returns:
-
Array of tokens that have been generated but not yet processed.
-
Return type:
-
np.ndarray
prompt_tokens
Returns the original prompt tokens.
-
Returns:
-
Array of tokens from the initial prompt.
-
Return type:
-
np.ndarray
request_id
request_id: str
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt.
-
Return type:
-
None
rollback()
rollback(idx)
-
Parameters:
-
idx (int)
-
Return type:
-
None
sampling_params
sampling_params: SamplingParams
set_draft_offset()
set_draft_offset(idx)
Sets the draft offset index used for speculative decoding.
-
Parameters:
-
idx (int) – The index to set as the draft offset.
-
Return type:
-
None
set_matcher()
set_matcher(matcher)
-
Parameters:
-
matcher (LLMatcher)
-
Return type:
-
None
set_token_indices()
set_token_indices(start_idx=None, active_idx=None, end_idx=None)
Set the token indices without manipulating the token array.
start_idx
property start_idx: int
status
status: GenerationStatus
target_endpoint
to_generation_output()
to_generation_output()
Get completion tokens that are ready to be returned to the user.
This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.
-
Returns:
-
The completion tokens and their associated log probabilities, if available.
-
Return type:
tokens
update()
update(new_token, log_probabilities=None)
Updates the next_tokens and extends existing tokens to include all generated tokens.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!