Skip to main content

Python class

TTSContext

TTSContext

class max.pipelines.TTSContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None, audio_prompt_tokens=<factory>, buffer_speech_tokens=None, audio_buffer=None, prev_samples_beyond_offset=0, streaming=False, _speech_token_size=128, _speech_token_end_idx=0, _speech_tokens=<factory>, decoded_index=0, _block_counter=0, _arrival_time=<factory>, audio_generation_status=GenerationStatus.ACTIVE)

source

Bases: TextContext

A context for Text-to-Speech (TTS) model inference.

This class extends TextContext to handle speech token generation and management. It maintains buffers for audio prompt tokens and generated speech tokens, along with tracking indices for decoding progress.

Parameters:

audio_buffer

audio_buffer: ndarray[tuple[Any, ...], dtype[floating[Any]]] | None = None

source

audio_generation_status

audio_generation_status: GenerationStatus = 'active'

source

audio_prompt_tokens

audio_prompt_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

source

block_counter

property block_counter: int

source

The number of speech token blocks generated.

buffer_speech_tokens

buffer_speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]] | None = None

source

decoded_index

decoded_index: int = 0

source

is_done

property is_done: bool

source

Whether audio generation has finished.

next_speech_tokens()

next_speech_tokens(audio_chunk_size=None, buffer=None)

source

Returns a chunk of the next unseen speech tokens.

Calling this function will not update the index of the last seen token. This must be done by setting decoded_index after the chunk is processed.

Parameters:

  • audio_chunk_size (int | None) – The number of speech tokens to return.
  • buffer (int | None) – The number of previous speech tokens to pass to the audio decoder on each generation step.

Returns:

A tuple of (chunk of speech tokens, buffer).

Return type:

tuple[ndarray[tuple[Any, …], dtype[integer[Any]]], int]

prev_samples_beyond_offset

prev_samples_beyond_offset: int = 0

source

speech_tokens

property speech_tokens: ndarray[tuple[Any, ...], dtype[integer[Any]]]

source

The slice of generated speech tokens valid so far.

streaming

streaming: bool = False

source

update_speech_tokens()

update_speech_tokens(new_tokens)

source

Updates the buffer with new speech tokens.

Parameters:

new_tokens (ndarray[tuple[Any, ...], dtype[integer[Any]]])

Return type:

None