Skip to main content

Python module

interfaces

Universal interfaces between all aspects of the MAX Inference Stack.

AudioGenerationMetadata

class max.interfaces.AudioGenerationMetadata(*, sample_rate=None, duration=None, chunk_id=None, timestamp=None, final_chunk=None, model_name=None, request_id=None, tokens_generated=None, processing_time=None, echo=None)

Represents metadata associated with audio generation.

This class will eventually replace the metadata dictionary used throughout the AudioGenerationOutput object, providing a structured and type-safe alternative for audio generation metadata.

Parameters:

  • sample_rate (int | None) – The sample rate of the generated audio in Hz.
  • duration (float | None) – The duration of the generated audio in seconds.
  • chunk_id (int | None) – Identifier for the audio chunk (useful for streaming).
  • timestamp (str | None) – Timestamp when the audio was generated.
  • final_chunk (bool | None) – Whether this is the final chunk in a streaming sequence.
  • model_name (str | None) – Name of the model used for generation.
  • request_id (str | None) – Unique identifier for the generation request.
  • tokens_generated (int | None) – Number of tokens generated for this audio.
  • processing_time (float | None) – Time taken to process this audio chunk in seconds.
  • echo (str | None) – Echo of the input prompt or identifier for verification.

chunk_id

chunk_id: int | None

duration

duration: float | None

echo

echo: str | None

final_chunk

final_chunk: bool | None

model_name

model_name: str | None

processing_time

processing_time: float | None

request_id

request_id: str | None

sample_rate

sample_rate: int | None

timestamp

timestamp: str | None

to_dict()

to_dict()

Convert the metadata to a dictionary format.

Returns:

Dictionary representation of the metadata.

Return type:

dict[str, any]

tokens_generated

tokens_generated: int | None

AudioGenerationRequest

class max.interfaces.AudioGenerationRequest(request_id: str, index: 'int', model: 'str', lora: 'str | None' = None, input: 'Optional[str]' = None, audio_prompt_tokens: 'list[int]' = <factory>, audio_prompt_transcription: 'str' = '', sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0), _assistant_message_override: 'str | None' = None, prompt: 'Optional[list[int] | str]' = None, streaming: 'bool' = True, buffer_speech_tokens: 'np.ndarray | None' = None)

Parameters:

audio_prompt_tokens

audio_prompt_tokens: list[int]

The prompt speech IDs to use for audio generation.

audio_prompt_transcription

audio_prompt_transcription: str = ''

The audio prompt transcription to use for audio generation.

buffer_speech_tokens

buffer_speech_tokens: np.ndarray | None = None

An optional field potentially containing the last N speech tokens generated by the model from a previous request.

When this field is specified, this tensor is used to buffer the tokens sent to the audio decoder.

index

index: int

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

input

input: str | None = None

The text to generate audio for. The maximum length is 4096 characters.

lora

lora: str | None = None

The name of the LoRA to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

model

model: str

The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt: list[int] | str | None = None

Optionally provide a preprocessed list of token ids or a prompt string to pass as input directly into the model. This replaces automatically generating TokenGeneratorRequestMessages given the input, audio prompt tokens, audio prompt transcription fields.

sampling_params

sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

Request sampling configuration options.

streaming

streaming: bool = True

Whether to stream the audio generation.

AudioGenerationResponse

class max.interfaces.AudioGenerationResponse(final_status, audio=None, buffer_speech_tokens=None)

Represents a response from the audio generation API.

This class encapsulates the result of an audio generation request, including the final status, generated audio data, and optional buffered speech tokens.

Parameters:

audio

audio: ndarray | None

The generated audio data, if available.

audio_data

property audio_data: ndarray

Returns the audio data if available.

Returns:

The generated audio data.

Return type:

ndarray

Raises:

AssertionError – If audio data is not available.

buffer_speech_tokens

buffer_speech_tokens: ndarray | None

Buffered speech tokens, if available.

final_status

final_status: GenerationStatus

The final status of the generation process.

has_audio_data

property has_audio_data: bool

Checks if audio data is present in the response.

Returns:

True if audio data is available, False otherwise.

Return type:

bool

is_done

property is_done: bool

Indicates whether the audio generation process is complete.

Returns:

True if generation is done, False otherwise.

Return type:

bool

AudioGenerator

class max.interfaces.AudioGenerator(*args, **kwargs)

Interface for audio generation models.

decoder_sample_rate

property decoder_sample_rate: int

The sample rate of the decoder.

next_chunk()

next_chunk(batch)

Computes the next audio chunk for a single batch.

The new speech tokens are saved to the context. The most recently generated audio is return through the AudioGenerationResponse.

Parameters:

batch (dict[str, AudioGeneratorContext]) – Batch of contexts.

Returns:

Dictionary mapping request IDs to audio generation responses.

Return type:

dict[str, AudioGenerationResponse]

prev_num_steps

property prev_num_steps: int

The number of speech tokens that were generated during the processing of the previous batch.

release()

release(context)

Releases resources associated with this context.

Parameters:

context (AudioGeneratorContext) – Finished context.

Return type:

None

AudioGeneratorOutput

class max.interfaces.AudioGeneratorOutput(audio_data, metadata, is_done, buffer_speech_tokens=None)

Represents the output of an audio generation step.

Parameters:

audio_data

audio_data: ndarray

The generated audio data as a NumPy array.

buffer_speech_tokens

buffer_speech_tokens: ndarray | None

An optional field containing the last N speech tokens generated by the model. This can be used to buffer speech tokens for a follow-up request, enabling seamless continuation of audio generation.

is_done

is_done: bool

Indicates whether the audio generation is complete (True) or if more chunks are expected (False).

metadata

metadata: AudioGenerationMetadata

Metadata associated with the audio generation, such as chunk information, prompt details, or other relevant context.

BaseContext

class max.interfaces.BaseContext(*args, **kwargs)

Core interface for request lifecycle management across all of MAX, including serving, scheduling, and pipelines.

This protocol is intended to provide a unified, minimal contract for request state and status handling throughout the MAX stack. Over time, BaseContext is expected to supersede and replace InputContext as the canonical context interface, as we refactor and standardize context handling across the codebase.

is_done

property is_done: bool

Whether the request has completed generation.

request_id

property request_id: str

Unique identifier for the request.

status

property status: GenerationStatus

Current generation status of the request.

update_status()

update_status(status)

Update the generation status of the request.

Parameters:

status (GenerationStatus)

Return type:

None

EmbeddingsGenerator

class max.interfaces.EmbeddingsGenerator(*args, **kwargs)

Interface for LLM embeddings-generator models.

encode()

encode(batch)

Computes embeddings for a batch of inputs.

Parameters:

batch (dict[str, EmbeddingsGeneratorContext]) – Batch of contexts to generate embeddings for.

Returns:

Dictionary mapping request IDs to their corresponding embeddings. Each embedding is typically a numpy array or tensor of floating point values.

Return type:

dict[str, Any]

EmbeddingsOutput

class max.interfaces.EmbeddingsOutput(embeddings)

Response structure for embedding generation.

Parameters:

embeddings (ndarray) – The generated embeddings as a NumPy array.

embeddings

embeddings: ndarray

The generated embeddings as a NumPy array.

is_done

property is_done: bool

Indicates whether the embedding generation process is complete.

Returns:

Always True, as embedding generation is a single-step operation.

Return type:

bool

GenerationStatus

class max.interfaces.GenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the status of a generation process in the MAX API.

ACTIVE

ACTIVE = 'active'

The generation process is ongoing.

END_OF_SEQUENCE

END_OF_SEQUENCE = 'end_of_sequence'

The generation process has reached the end of the sequence.

MAXIMUM_LENGTH

MAXIMUM_LENGTH = 'maximum_length'

The generation process has reached the maximum allowed length.

is_done

property is_done: bool

Returns True if the generation process is complete (not ACTIVE).

Returns:

True if the status is not ACTIVE, indicating completion.

Return type:

bool

InputContext

class max.interfaces.InputContext(*args, **kwargs)

Protocol defining the interface for model input contexts in token generation.

An InputContext represents model inputs for TokenGenerator instances, managing the state of tokens throughout the generation process. It handles token arrays, generation status, sampling parameters, and various indices that track different stages of token processing.

The context maintains a token array with the following layout:

.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. | completed | next_tokens | | preallocated |
. +--------------------+---------------+-----------------+----------------+
. start_idx ^ active_idx ^ end_idx ^
Token Array Regions:
  • completed: Tokens that have already been processed and encoded.
    • next_tokens: Tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill.
    • preallocated: Token slots that have been preallocated. The token array resizes to multiples of CHUNK_SIZE to accommodate new tokens.
    Key Indices:
  • start_idx: Marks the beginning of completed tokens
    • active_idx: Marks the start of next_tokens within the array
    • end_idx: Marks the end of all active tokens (one past the last token)
    • committed_idx: Marks tokens that have been committed and returned to the user

    active_idx

    property active_idx: int

    The index marking the start of next_tokens within the token array.

    This index separates completed tokens from tokens that will be processed in the next iteration during chunked prefill or generation.

    Returns:

    The zero-based index where next_tokens begin in the token array.

    active_length

    property active_length: int

    The number of tokens being processed in the current iteration.

    During context encoding (prompt processing), this equals the prompt size or chunk size for chunked prefill. During token generation, this is typically 1 (one new token per iteration).

    Returns:

    The number of tokens being processed in this iteration.

    all_tokens

    property all_tokens: ndarray

    All active tokens in the context (prompt and generated).

    This property returns only the meaningful tokens, excluding any preallocated but unused slots in the token array.

    Returns:

    A 1D NumPy array containing all prompt and generated tokens.

    bump_token_indices()

    bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)

    Increment token indices by the specified amounts.

    This method provides fine-grained control over token index management, allowing incremental updates to track token processing progress.

    Parameters:

    • start_idx (int) – Amount to increment the start_idx by.
    • active_idx (int) – Amount to increment the active_idx by.
    • end_idx (int) – Amount to increment the end_idx by.
    • committed_idx (int) – Amount to increment the committed_idx by.

    Return type:

    None

    committed_idx

    property committed_idx: int

    The index marking tokens that have been committed and returned to the user.

    Committed tokens are those that have been finalized in the generation process and delivered as output to the user.

    Returns:

    The zero-based index up to which tokens have been committed.

    compute_num_available_steps()

    compute_num_available_steps(max_seq_len)

    Compute the maximum number of generation steps available.

    This method calculates how many tokens can be generated without exceeding the specified maximum sequence length limit.

    Parameters:

    max_seq_len (int) – The maximum allowed sequence length for this context.

    Returns:

    The number of generation steps that can be executed before reaching the sequence length limit.

    Return type:

    int

    current_length

    property current_length: int

    The current total length of the sequence.

    This includes both completed tokens and tokens currently being processed, representing the total number of tokens in the active sequence.

    Returns:

    The total number of tokens including completed and active tokens.

    end_idx

    property end_idx: int

    The index marking the end of all active tokens in the token array.

    This is an exclusive end index (one past the last active token), following Python’s standard slicing conventions.

    Returns:

    The zero-based index one position past the last active token.

    eos_token_ids

    property eos_token_ids: set[int]

    The set of end-of-sequence token IDs that can terminate generation.

    Returns:

    A set of token IDs that, when generated, will signal the end of the sequence and terminate the generation process.

    generated_tokens

    property generated_tokens: ndarray

    All tokens generated by the model for this context.

    This excludes the original prompt tokens and includes only tokens that have been produced during the generation process.

    Returns:

    A 1D NumPy array containing generated token IDs.

    get_min_token_logit_mask()

    get_min_token_logit_mask(num_steps)

    Get token indices that should be masked in the output logits.

    This method is primarily used to implement the min_tokens constraint, where certain tokens (typically EOS tokens) are masked to prevent early termination before the minimum token count is reached.

    Parameters:

    num_steps (int) – The number of generation steps to compute masks for.

    Returns:

    A list of NumPy arrays, where each array contains token indices that should be masked (set to negative infinity) in the logits for the corresponding generation step.

    Return type:

    list[ndarray[Any, dtype[int32]]]

    is_ce

    property is_ce: bool

    Whether this context is in context encoding (CE) mode.

    Context encoding mode indicates that the context is processing input tokens (prompt) rather than generating new tokens.

    Returns:

    True if this is a context encoding context, False if it’s in token generation mode.

    is_done

    property is_done: bool

    Whether the generation process for this context has completed.

    Returns:

    True if generation has finished successfully or been terminated, False if generation is still in progress.

    is_initial_prompt

    property is_initial_prompt: bool

    Whether this context contains only the initial prompt.

    This property indicates if the context has not yet been updated with any generated tokens and still contains only the original input.

    Returns:

    True if no tokens have been generated yet, False if generation has begun and tokens have been added.

    json_schema

    property json_schema: str | None

    The JSON schema for constrained decoding, if configured.

    When set, this schema constrains token generation to produce valid JSON output that conforms to the specified structure.

    Returns:

    The JSON schema string, or None if no schema constraint is active.

    jump_ahead()

    jump_ahead(new_token)

    Jump ahead in generation by adding a token and updating indices.

    This method is used in speculative decoding scenarios to quickly advance the generation state when draft tokens are accepted.

    Parameters:

    new_token (int) – The token ID to add when jumping ahead in the sequence.

    Return type:

    None

    log_probabilities

    property log_probabilities: int

    The number of top tokens to return log probabilities for.

    When greater than 0, the system returns log probabilities for the top N most likely tokens at each generation step.

    Returns:

    The number of top tokens to include in log probability output. Returns 0 if log probabilities are disabled.

    log_probabilities_echo

    property log_probabilities_echo: bool

    Whether to include input tokens in the returned log probabilities.

    When True, log probabilities will be computed and returned for input (prompt) tokens in addition to generated tokens.

    Returns:

    True if input tokens should be included in log probability output, False otherwise.

    matcher

    property matcher: Any | None

    The grammar matcher for structured output generation, if configured.

    The matcher enforces structural constraints (like JSON schema) during generation to ensure valid formatted output.

    Returns:

    The grammar matcher instance, or None if no structured generation is configured for this context.

    max_length

    property max_length: int | None

    The maximum allowed length for this sequence.

    When set, generation will stop when this length is reached, regardless of other stopping criteria.

    Returns:

    The maximum sequence length limit, or None if no limit is set.

    min_tokens

    property min_tokens: int

    The minimum number of new tokens that must be generated.

    Generation will continue until at least this many new tokens have been produced, even if other stopping criteria are met (e.g., EOS tokens).

    Returns:

    The minimum number of new tokens to generate.

    next_tokens

    property next_tokens: ndarray

    The tokens to be processed in the next model iteration.

    This array contains the tokens that will be fed to the model in the upcoming forward pass. The length should match active_length.

    Returns:

    A 1D NumPy array of token IDs with length equal to active_length.

    outstanding_completion_tokens()

    outstanding_completion_tokens()

    Get completion tokens that are ready to be returned to the user.

    This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

    Returns:

    A list of tuples, where each tuple contains a token ID and its associated log probabilities (or None if log probabilities are not enabled).

    Return type:

    list[tuple[int, LogProbabilities | None]]

    prompt_tokens

    property prompt_tokens: ndarray

    The original prompt tokens for this context.

    These are the input tokens that were provided to start the generation process, before any tokens were generated by the model.

    Returns:

    A 1D NumPy array containing the original prompt token IDs.

    request_id

    property request_id: str

    The unique identifier for this generation request.

    Returns:

    A RequestID that uniquely identifies this request across the system.

    reset()

    reset()

    Reset the context state by consolidating all tokens into a new prompt.

    This method is typically used when a request is evicted from cache, requiring the context to be re-encoded in a subsequent context encoding iteration. All generated tokens become part of the new prompt.

    Return type:

    None

    rollback()

    rollback(idx)

    Rollback generation by removing the specified number of tokens.

    This method is used to undo recent generation steps, typically when implementing techniques like beam search or when handling generation errors that require backtracking.

    Parameters:

    idx (int) – The number of tokens to remove from the end of the sequence.

    Return type:

    None

    sampling_params

    property sampling_params: SamplingParams

    The sampling parameters configured for this generation request.

    These parameters control how tokens are selected during generation, including temperature, top-k/top-p filtering, and stopping criteria.

    Returns:

    The SamplingParams instance containing all sampling configuration for this context.

    set_draft_offset()

    set_draft_offset(idx)

    Set the draft token offset for speculative decoding optimization.

    This method configures the offset used in speculative decoding, where draft tokens are generated speculatively to improve generation throughput.

    Parameters:

    idx (int) – The offset index for draft tokens in the speculative decoding process.

    Return type:

    None

    set_matcher()

    set_matcher(matcher)

    Set a grammar matcher for constrained decoding.

    This method configures structured output generation by installing a grammar matcher that enforces format constraints during token generation.

    Parameters:

    matcher (Any) – The grammar matcher instance to use for constraining output. The specific type depends on the structured generation backend.

    Return type:

    None

    set_token_indices()

    set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)

    Set token indices to specific absolute values.

    This method provides direct control over token index positioning, allowing precise management of the token array state.

    Parameters:

    • start_idx (int | None) – New absolute value for start_idx, if provided.
    • active_idx (int | None) – New absolute value for active_idx, if provided.
    • end_idx (int | None) – New absolute value for end_idx, if provided.
    • committed_idx (int | None) – New absolute value for committed_idx, if provided.

    Return type:

    None

    start_idx

    property start_idx: int

    The index marking the start of completed tokens in the token array.

    Completed tokens are those that have already been processed and encoded by the model in previous iterations.

    Returns:

    The zero-based index where completed tokens begin in the token array.

    status

    property status: GenerationStatus

    The current generation status of this context.

    Returns:

    The GenerationStatus indicating the current state of generation (e.g., encoding, generating, completed, or error).

    tokens

    property tokens: ndarray

    The complete token array including preallocated slots.

    This includes all tokens (completed, active, and preallocated empty slots). For most use cases, prefer all_tokens to get only the active tokens.

    Returns:

    A 1D NumPy array containing all tokens including padding.

    update()

    update(new_token, log_probabilities=None)

    Update the context with a newly generated token.

    This method adds a generated token to the context, updating the token array and associated metadata. It also stores log probability information if provided.

    Parameters:

    • new_token (int) – The token ID to add to the generation sequence.
    • log_probabilities (LogProbabilities | None) – Optional log probability data for the new token and alternatives. Used for analysis and debugging.

    Return type:

    None

    update_status()

    update_status(status)

    Update the current generation status of this context.

    This method transitions the context to a new generation state, such as moving from encoding to generating or marking completion.

    Parameters:

    status (GenerationStatus) – The new GenerationStatus to assign to this context.

    Return type:

    None

    LogProbabilities

    class max.interfaces.LogProbabilities(token_log_probabilities, top_log_probabilities)

    Log probabilities for an individual output token.

    This is a data-only class that serves as a serializable data structure for transferring log probability information. It does not provide any functionality for calculating or manipulating log probabilities - it is purely for data storage and serialization purposes.

    Parameters:

    token_log_probabilities

    token_log_probabilities: list[float]

    Probabilities of each token.

    top_log_probabilities

    top_log_probabilities: list[dict[int, float]]

    Top tokens and their corresponding probabilities.

    MAXQueue

    class max.interfaces.MAXQueue(*args, **kwargs)

    Protocol for a minimal, non-blocking queue interface in MAX.

    This protocol defines the minimal contract for a queue that supports non-blocking put and get operations. It is generic over the item type.

    get_nowait()

    get_nowait()

    Remove and return an item from the queue without blocking.

    This method is expected to raise queue.Empty if no item is available to retrieve from the queue.

    Returns:

    The item removed from the queue.

    Return type:

    ItemType

    Raises:

    queue.Empty – If the queue is empty and no item can be retrieved.

    put_nowait()

    put_nowait(item)

    Attempt to put an item into the queue without blocking.

    This method is designed to immediately fail (typically by raising an exception) if the item cannot be added to the queue at the time of the call. Unlike the traditional ‘put’ method in many queue implementations—which may block until space becomes available or the transfer is completed—this method never waits. It is intended for use cases where the caller must be notified of failure to enqueue immediately, rather than waiting for space.

    Parameters:

    item (ItemType) – The item to be added to the queue.

    Return type:

    None

    Pipeline

    class max.interfaces.Pipeline

    Abstract base class for pipeline operations.

    This generic abstract class defines the interface for pipeline operations that transform inputs of type PipelineInputsType into outputs of type PipelineOutputsDict[PipelineOutputType]. All concrete pipeline implementations must inherit from this class and implement the execute method.

    Type Parameters:
    PipelineInputsType: The type of inputs this pipeline accepts, must inherit from PipelineInputs PipelineOutputType: The type of outputs this pipeline produces, must be a subclass of PipelineOutput
    class MyPipeline(Pipeline[MyInputs, MyOutput]):
    def execute(self, inputs: MyInputs) -> dict[RequestID, MyOutput]:
    # Implementation here
    pass
    class MyPipeline(Pipeline[MyInputs, MyOutput]):
    def execute(self, inputs: MyInputs) -> dict[RequestID, MyOutput]:
    # Implementation here
    pass

    execute()

    abstract execute(inputs)

    Execute the pipeline operation with the given inputs.

    This method must be implemented by all concrete pipeline classes. It takes inputs of the specified type and returns outputs according to the pipeline’s processing logic.

    Parameters:

    inputs (PipelineInputsType) – The input data for the pipeline operation, must be of type PipelineInputsType

    Returns:

    The results of the pipeline operation, as a dictionary mapping RequestID to PipelineOutputType

    Raises:

    NotImplementedError – If not implemented by a concrete subclass

    Return type:

    dict[str, PipelineOutputType]

    release()

    abstract release(request_id)

    Release any resources or state associated with a specific request.

    This method should be implemented by concrete pipeline classes to perform cleanup or resource deallocation for the given request ID. It is typically called when a request has completed processing and its associated resources (such as memory, cache, or temporary files) are no longer needed.

    Parameters:

    request_id (RequestID) – The unique identifier of the request to release resources for.

    Returns:

    None

    Raises:

    NotImplementedError – If not implemented by a concrete subclass.

    Return type:

    None

    PipelineTask

    class max.interfaces.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

    Enum representing the types of pipeline tasks supported.

    AUDIO_GENERATION

    AUDIO_GENERATION = 'audio_generation'

    Task for generating audio.

    EMBEDDINGS_GENERATION

    EMBEDDINGS_GENERATION = 'embeddings_generation'

    Task for generating embeddings.

    SPEECH_TOKEN_GENERATION

    SPEECH_TOKEN_GENERATION = 'speech_token_generation'

    Task for generating speech tokens.

    TEXT_GENERATION

    TEXT_GENERATION = 'text_generation'

    Task for generating text.

    output_type

    property output_type: type

    Get the output type for the pipeline task.

    Returns:

    The output type for the pipeline task.

    Return type:

    type

    PipelineTokenizer

    class max.interfaces.PipelineTokenizer(*args, **kwargs)

    Interface for LLM tokenizers.

    decode()

    async decode(encoded, **kwargs)

    Decodes response tokens to text.

    Parameters:

    encoded (TokenizerEncoded) – Encoded response tokens.

    Returns:

    Un-encoded response text.

    Return type:

    str

    encode()

    async encode(prompt, add_special_tokens)

    Encodes text prompts as tokens.

    Parameters:

    • prompt (str) – Un-encoded prompt text.
    • add_special_tokens (bool)

    Raises:

    ValueError – If the prompt exceeds the configured maximum length.

    Return type:

    TokenizerEncoded

    eos

    property eos: int

    The end of sequence token for this tokenizer.

    expects_content_wrapping

    property expects_content_wrapping: bool

    If true, this tokenizer expects messages to have a content property.

    Text messages are formatted as:

    { "type": "text", "content": "text content" }
    { "type": "text", "content": "text content" }

    instead of the OpenAI spec:

    { "type": "text", "text": "text content" }
    { "type": "text", "text": "text content" }

    NOTE: Multimodal messages omit the content property. Both image_urls and image content parts are converted to:

    { "type": "image" }
    { "type": "image" }

    Their content is provided as byte arrays through the top-level property on the request object, i.e., RequestType.images.

    new_context()

    async new_context(request)

    Creates a new context from a request object. This is sent to the worker process once and then cached locally.

    Parameters:

    request (RequestType) – Incoming request.

    Returns:

    Initialized context.

    Return type:

    UnboundContextType

    Request

    class max.interfaces.Request(request_id)

    Base class representing a generic request within the MAX API.

    This class provides a unique identifier for each request, ensuring that all requests can be tracked and referenced consistently throughout the system. Subclasses can extend this class to include additional fields specific to their request types.

    Parameters:

    request_id (str)

    request_id

    request_id: str

    RequestID

    max.interfaces.RequestID

    alias of str

    SamplingParams

    class max.interfaces.SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

    Request specific sampling parameters that are only known at run time.

    Parameters:

    detokenize

    detokenize: bool = True

    Whether to detokenize the output tokens into text.

    frequency_penalty

    frequency_penalty: float = 0.0

    The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

    ignore_eos

    ignore_eos: bool = False

    If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

    max_new_tokens

    max_new_tokens: int | None = None

    The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

    min_new_tokens

    min_new_tokens: int = 0

    The minimum number of tokens to generate in the response.

    min_p

    min_p: float = 0.0

    Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

    presence_penalty

    presence_penalty: float = 0.0

    The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

    repetition_penalty

    repetition_penalty: float = 1.0

    The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

    seed

    seed: int = 0

    The seed to use for the random number generator.

    stop

    stop: list[str] | None = None

    A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

    stop_token_ids

    stop_token_ids: list[int] | None = None

    A list of token ids that are used as stopping criteria when generating a new sequence.

    temperature

    temperature: float = 1

    Controls the randomness of the model’s output; higher values produce more diverse responses.

    top_k

    top_k: int = 1

    Limits the sampling to the K most probable tokens. This defaults to 1, which enables greedy sampling.

    top_p

    top_p: float = 1

    Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

    SchedulerResult

    class max.interfaces.SchedulerResult(status, result)

    Structure representing the result of a scheduler operation for a specific pipeline execution.

    This class encapsulates the outcome of a pipeline operation as managed by the scheduler, including both the execution status and any resulting data from the pipeline. The scheduler uses this structure to communicate the state of pipeline operations back to clients, whether the operation is still running, has completed successfully, or was cancelled.

    The generic type parameter allows this result to work with different types of pipeline outputs while maintaining type safety.

    Parameters:

    active()

    classmethod active(result)

    Create a SchedulerResult representing an active pipeline operation.

    Parameters:

    result (PipelineOutputType) – The current pipeline output data (may be partial for streaming operations).

    Returns:

    A SchedulerResult with ACTIVE status and the provided result.

    Return type:

    SchedulerResult

    cancelled()

    classmethod cancelled()

    Create a SchedulerResult representing a cancelled pipeline operation.

    Returns:

    A SchedulerResult with CANCELLED status and no result.

    Return type:

    SchedulerResult

    complete()

    classmethod complete(result)

    Create a SchedulerResult representing a completed pipeline operation.

    Parameters:

    result (PipelineOutputType) – The final pipeline output data.

    Returns:

    A SchedulerResult with COMPLETE status and the final result.

    Return type:

    SchedulerResult

    result

    result: PipelineOutputType | None

    The pipeline output data, if any. May be None for cancelled operations or during intermediate states of streaming operations.

    status

    status: SchedulerStatus

    The current status of the pipeline operation from the scheduler’s perspective.

    stop_stream

    property stop_stream: bool

    Determine if the pipeline operation stream should continue based on the current status.

    Returns:

    True if the pipeline operation stream should stop (CANCELLED or COMPLETE), False if it should continue (ACTIVE).

    Return type:

    bool

    SchedulerStatus

    class max.interfaces.SchedulerStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

    Represents the status of a scheduler operation for a specific pipeline execution.

    The scheduler manages the execution of pipeline operations and returns status updates to indicate the current state of the pipeline execution. This enum defines the possible states that a pipeline operation can be in from the scheduler’s perspective.

    ACTIVE

    ACTIVE = 'active'

    Indicates that the scheduler executed the pipeline operation successfully and request remains active.

    CANCELLED

    CANCELLED = 'cancelled'

    Indicates that the pipeline operation was cancelled before completion; no further data will be provided.

    COMPLETE

    COMPLETE = 'complete'

    Indicates that the pipeline operation was previously finished and no further data should be streamed.

    SharedMemoryArray

    class max.interfaces.SharedMemoryArray(name, shape, dtype)

    Wrapper for numpy array stored in shared memory.

    This class is used as a placeholder in pixel_values during serialization. It will be encoded as a dict with __shm__ flag and decoded back to a numpy array.

    Parameters:

    TextGenerationInputs

    class max.interfaces.TextGenerationInputs(batch, num_steps)

    Input parameters for text generation pipeline operations.

    This class encapsulates the batch of contexts and number of steps required for token generation in a single input object, replacing the previous pattern of passing batch and num_steps as separate parameters.

    Parameters:

    • batch (dict[str, TextGenerationContextType])
    • num_steps (int)

    batch

    batch: dict[str, TextGenerationContextType]

    Dictionary mapping request IDs to context objects.

    num_steps

    num_steps: int

    Number of tokens to generate.

    TextGenerationOutput

    class max.interfaces.TextGenerationOutput(request_id, tokens, final_status, log_probabilities=None)

    Represents the output of a text generation operation, combining token IDs, final generation status, request ID, and optional log probabilities for each token.

    Parameters:

    final_status

    final_status: GenerationStatus

    The final status of the generation process.

    is_done

    property is_done: bool

    Indicates whether the text generation process is complete.

    Returns:

    True if the generation is done, False otherwise.

    Return type:

    bool

    log_probabilities

    log_probabilities: list[LogProbabilities] | None

    Optional list of log probabilities for each token.

    request_id

    request_id: str

    The unique identifier for the generation request.

    tokens

    tokens: list[int]

    List of generated token IDs.

    TextGenerationRequest

    class max.interfaces.TextGenerationRequest(request_id: str, index: 'int', model_name: 'str', lora_name: 'str | None' = None, prompt: 'Union[str, Sequence[int], None]' = None, messages: 'Optional[list[TextGenerationRequestMessage]]' = None, images: 'Optional[list[bytes]]' = None, tools: 'Optional[list[TextGenerationRequestTool]]' = None, response_format: 'Optional[TextGenerationResponseFormat]' = None, timestamp_ns: 'int' = 0, request_path: 'str' = '/', logprobs: 'int' = 0, echo: 'bool' = False, stop: 'Optional[Union[str, list[str]]]' = None, chat_template_options: 'Optional[dict[str, Any]]' = None, sampling_params: 'SamplingParams' = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0))

    Parameters:

    chat_template_options

    chat_template_options: dict[str, Any] | None = None

    Optional dictionary of options to pass when applying the chat template.

    echo

    echo: bool = False

    If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.

    images

    images: list[bytes] | None = None

    A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.

    index

    index: int

    The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

    logprobs

    logprobs: int = 0

    The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.

    lora_name

    lora_name: str | None = None

    The name of the lora to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

    messages

    messages: list[TextGenerationRequestMessage] | None = None

    A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.

    model_name

    model_name: str

    The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

    prompt

    prompt: str | Sequence[int] | None = None

    The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.

    request_path

    request_path: str = '/'

    The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.

    response_format

    response_format: TextGenerationResponseFormat | None = None

    Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.

    sampling_params

    sampling_params: SamplingParams = SamplingParams(top_k=1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=0)

    Token sampling configuration parameters for the request.

    stop

    stop: str | list[str] | None = None

    //platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

    Type:

    Optional list of stop expressions (see

    Type:

    https

    timestamp_ns

    timestamp_ns: int = 0

    The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.

    tools

    tools: list[TextGenerationRequestTool] | None = None

    A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.

    TextGenerationRequestFunction

    class max.interfaces.TextGenerationRequestFunction

    Represents a function definition for a text generation request.

    description

    description: str

    A human-readable description of the function’s purpose.

    name

    name: str

    The name of the function to be invoked.

    parameters

    parameters: dict

    A dictionary describing the function’s parameters, typically following a JSON schema.

    TextGenerationRequestMessage

    class max.interfaces.TextGenerationRequestMessage

    content

    content: str | list[dict[str, Any]]

    Content can be a simple string or a list of message parts of different modalities.

    For example:

    {
    "role": "user",
    "content": "What's the weather like in Boston today?"
    }
    {
    "role": "user",
    "content": "What's the weather like in Boston today?"
    }

    Or:

    {
    "role": "user",
    "content": [
    {
    "type": "text",
    "text": "What's in this image?"
    },
    {
    "type": "image_url",
    "image_url": {
    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
    }
    ]
    }
    {
    "role": "user",
    "content": [
    {
    "type": "text",
    "text": "What's in this image?"
    },
    {
    "type": "image_url",
    "image_url": {
    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
    }
    ]
    }

    role

    role: Literal['system', 'user', 'assistant']

    The role of the message sender, indicating whether the message is from the system, user, or assistant.

    TextGenerationRequestTool

    class max.interfaces.TextGenerationRequestTool

    Represents a tool definition for a text generation request.

    function

    function: TextGenerationRequestFunction

    The function definition associated with the tool, including its name, description, and parameters.

    type

    type: str

    The type of the tool, typically indicating the tool’s category or usage.

    TextGenerationResponseFormat

    class max.interfaces.TextGenerationResponseFormat

    Represents the response format specification for a text generation request.

    json_schema

    json_schema: dict

    A JSON schema dictionary that defines the structure and validation rules for the generated response.

    type

    type: str

    The type of response format, e.g., “json_object”.