Skip to main content

Python class

TTSContext

TTSContext​

class max.pipelines.TTSContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, in_reasoning_phase=False, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None, _cache_metrics_emitted=False, audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE)

source

Bases: TextContext

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

audio_buffer​

audio_buffer: ndarray[tuple[Any, ...], dtype[floating[Any]]] | None = None

source

audio_generation_status​

audio_generation_status: GenerationStatus = 'active'

source

audio_prompt_tokens​

audio_prompt_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

source

block_counter​

property block_counter: int

source

The number of speech token blocks generated.

buffer_speech_tokens​

buffer_speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]] | None = None

source

decoded_index​

decoded_index: int = 0

source

is_done​

property is_done: bool

source

Whether audio generation has finished.

next_speech_tokens()​

next_speech_tokens(audio_chunk_size=None, buffer=None)

source

Returns a chunk of the next unseen speech tokens.

Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.

Parameters:

  • audio_chunk_size (int | None) – The number of speech tokens to return.
  • buffer (int | None) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray[tuple[Any, …], dtype[integer[Any]]], int]

prev_samples_beyond_offset​

prev_samples_beyond_offset: int = 0

source

speech_tokens​

property speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

source

The slice of generated speech tokens valid so far.

streaming​

streaming: bool = False

source

update_speech_tokens()​

update_speech_tokens(new_tokens)

source

Updates the buffer with new speech tokens.

Parameters:

new_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]])

Return type:

None