Python module

interfaces

Universal interfaces between all aspects of the MAX Inference Stack.

`AudioGenerationInputs`

class max.interfaces.AudioGenerationInputs(batch)

Input data structure for audio generation pipelines.

This class represents the input data required for audio generation operations within the pipeline framework. It extends PipelineInputs and provides type-safe generic support for different audio generation context types.

Parameters:: batch (dict[RequestID, AudioGenerationContextType])

`batch`

batch: dict[RequestID, AudioGenerationContextType]

A dictionary mapping RequestID to AudioGenerationContextType instances. This batch structure allows for processing multiple audio generation requests simultaneously while maintaining request-specific context and configuration data.

`AudioGenerationMetadata`

class max.interfaces.AudioGenerationMetadata(*, sample_rate=None, duration=None, chunk_id=None, timestamp=None, final_chunk=None, model_name=None, request_id=None, tokens_generated=None, processing_time=None, echo=None)

Represents metadata associated with audio generation.

This class will eventually replace the metadata dictionary used throughout the AudioGenerationOutput object, providing a structured and type-safe alternative for audio generation metadata.

Parameters:

sample_rate (int | None) – The sample rate of the generated audio in Hz.
duration (float | None) – The duration of the generated audio in seconds.
chunk_id (int | None) – Identifier for the audio chunk (useful for streaming).
timestamp (str | None) – Timestamp when the audio was generated.
final_chunk (bool | None) – Whether this is the final chunk in a streaming sequence.
model_name (str | None) – Name of the model used for generation.
request_id (RequestID | None) – Unique identifier for the generation request.
tokens_generated (int | None) – Number of tokens generated for this audio.
processing_time (float | None) – Time taken to process this audio chunk in seconds.
echo (str | None) – Echo of the input prompt or identifier for verification.

`chunk_id`

chunk_id: int | None

`duration`

duration: float | None

`echo`

echo: str | None

`final_chunk`

final_chunk: bool | None

`model_name`

model_name: str | None

`processing_time`

processing_time: float | None

`request_id`

request_id: RequestID | None

`sample_rate`

sample_rate: int | None

`timestamp`

timestamp: str | None

`to_dict()`

to_dict()

Convert the metadata to a dictionary format.

Returns:: Dictionary representation of the metadata.
Return type:: dict[str, any]

`tokens_generated`

tokens_generated: int | None

`AudioGenerationOutput`

class max.interfaces.AudioGenerationOutput(final_status, steps_executed, audio_data=<factory>, buffer_speech_tokens=None, metadata=<factory>)

Represents a response from the audio generation API.

This class encapsulates the result of an audio generation request, including the final status, generated audio data, and optional buffered speech tokens.

Parameters:

final_status (GenerationStatus)
steps_executed (int)
audio_data (ndarray[tuple[int, ...], dtype[float32]])
buffer_speech_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]] | None)
metadata (AudioGenerationMetadata)

`audio_data`

audio_data: ndarray[tuple[int, ...], dtype[float32]]

The generated audio data, if available.

`buffer_speech_tokens`

buffer_speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]] | None

Buffered speech tokens, if available.

`final_status`

final_status: GenerationStatus

The final status of the generation process.

`is_done`

property is_done: bool

Indicates whether the audio generation process is complete.

Returns:: True if generation is done, False otherwise.
Return type:: bool

`metadata`

metadata: AudioGenerationMetadata

Metadata associated with the audio generation, such as chunk information, prompt details, or other relevant context.

`steps_executed`

steps_executed: int

The number of steps previously executed.

`AudioGenerationRequest`

class max.interfaces.AudioGenerationRequest(request_id: 'RequestID', model: 'str', input: 'str | None' = None, audio_prompt_tokens: 'list[int]' = <factory>, audio_prompt_transcription: 'str' = '', sampling_params: 'SamplingParams' = <factory>, _assistant_message_override: 'str | None' = None, prompt: 'list[int] | str | None' = None, streaming: 'bool' = True, buffer_speech_tokens: 'npt.NDArray[np.integer[Any]] | None' = None)

Parameters:

request_id (RequestID)
model (str)
input (str | None)
audio_prompt_tokens (list[int])
audio_prompt_transcription (str)
sampling_params (SamplingParams)
_assistant_message_override (str | None)
prompt (list[int] | str | None)
streaming (bool)
buffer_speech_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]] | None)

`audio_prompt_tokens`

audio_prompt_tokens: list[int]

The prompt speech IDs to use for audio generation.

`audio_prompt_transcription`

audio_prompt_transcription: str = ''

The audio prompt transcription to use for audio generation.

`buffer_speech_tokens`

buffer_speech_tokens: ndarray[tuple[int, ...], dtype[integer[Any]]] | None = None

An optional field potentially containing the last N speech tokens generated by the model from a previous request.

When this field is specified, this tensor is used to buffer the tokens sent to the audio decoder.

`input`

input: str | None = None

The text to generate audio for. The maximum length is 4096 characters.

`model`

model: str

The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation.

`prompt`

prompt: list[int] | str | None = None

Optionally provide a preprocessed list of token ids or a prompt string to pass as input directly into the model. This replaces automatically generating TokenGeneratorRequestMessages given the input, audio prompt tokens, audio prompt transcription fields.

`sampling_params`

sampling_params: SamplingParams

Request sampling configuration options.

`streaming`

streaming: bool = True

Whether to stream the audio generation.

`BaseContext`

class max.interfaces.BaseContext(*args, **kwargs)

Core interface for request lifecycle management across all of MAX, including serving, scheduling, and pipelines.

This protocol is intended to provide a unified, minimal contract for request state and status handling throughout the MAX stack. Each pipeline variant (e.g., text generation, embeddings, image generation) is expected to extend this interface by creating their own modality-specific context classes that implement this protocol and add additional functionality relevant to their particular use case.

The minimal interface ensures that all context types can be handled uniformly by the scheduling and serving infrastructure, while allowing pipeline-specific implementations to add their own state management, input validation, and result handling.

`is_done`

property is_done: bool

Whether the request has completed generation.

`request_id`

property request_id: RequestID

Unique identifier for the request.

`status`

property status: GenerationStatus

Current generation status of the request.

`BatchProcessorInputs`

class max.interfaces.BatchProcessorInputs(logits, logit_offsets, context_batch)

Arguments for a batch logits processor.

logits: The model logits, a float32 tensor with shape (N_batch, vocab_size). N_batch is the number of logits returned by the model for each sequence in the batch.
logit_offsets: If the model returns multiple logits, this is a tensor with shape (batch_size + 1, 1) that contains the offsets of each sequence in the batch. Otherwise, this is None.
context_batch: The batch of contexts containing the inputs to the model.

Parameters:

logits (md.Tensor)
logit_offsets (md.Tensor | None)
context_batch (Sequence[TextGenerationContext])

`context_batch`

context_batch: Sequence[TextGenerationContext]

`logit_offsets`

logit_offsets: md.Tensor | None

`logits`

logits: md.Tensor

`BatchType`

class max.interfaces.BatchType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Type of batch.

`CE`

CE = 'CE'

Context encoding batch.

`TG`

TG = 'TG'

Token generation batch.

`EmbeddingsContext`

class max.interfaces.EmbeddingsContext(*args, **kwargs)

Protocol defining the interface for embeddings generation contexts.

An EmbeddingsContext represents model inputs for embeddings generation pipelines, managing the state and parameters needed for generating embeddings from input text. Unlike text generation contexts, this focuses on single-step embedding generation without iterative token generation concerns.

This protocol includes only the fields necessary for embeddings generation, excluding text generation specific features like:

End-of-sequence token handling (eos_token_ids)
Grammar matchers for structured output (matcher)
JSON schema constraints (json_schema)
Log probability tracking (log_probabilities)
Token generation iteration state

`active_length`

property active_length: int

The length of the active array.

Returns:: The length as int.

`model_name`

property model_name: str

The name of the embeddings model to use.

Returns:: A string identifying the specific embeddings model for this request.

`tokens`

property tokens: ndarray[tuple[int, ...], dtype[integer[Any]]]

The input tokens to be embedded.

Returns:: A NumPy array of token IDs representing the input text to generate embeddings for.

`EmbeddingsGenerationInputs`

class max.interfaces.EmbeddingsGenerationInputs(batches: 'list[dict[RequestID, EmbeddingsContext]]')

Parameters:: batches (list[dict[RequestID, EmbeddingsContext]])

`batch`

property batch: dict[RequestID, EmbeddingsContext]

Returns merged batches.

`batches`

batches: list[dict[RequestID, EmbeddingsContext]]

`EmbeddingsGenerationOutput`

class max.interfaces.EmbeddingsGenerationOutput(embeddings)

Response structure for embedding generation.

Parameters:: embeddings (ndarray[tuple[int, ...], dtype[floating[Any]]]) – The generated embeddings as a NumPy array.

`embeddings`

embeddings: ndarray[tuple[int, ...], dtype[floating[Any]]]

The generated embeddings as a NumPy array.

`is_done`

property is_done: bool

Indicates whether the embedding generation process is complete.

Returns:: Always True, as embedding generation is a single-step operation.
Return type:: bool

`GenerationStatus`

class max.interfaces.GenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the status of a generation process in the MAX API.

`ACTIVE`

ACTIVE = 'active'

The generation process is ongoing.

`CANCELLED`

CANCELLED = 'cancelled'

The generation process has been cancelled by the user.

`END_OF_SEQUENCE`

END_OF_SEQUENCE = 'end_of_sequence'

The generation process has reached the end of the sequence.

`MAXIMUM_LENGTH`

MAXIMUM_LENGTH = 'maximum_length'

The generation process has reached the maximum allowed length.

`is_done`

property is_done: bool

Returns True if the generation process is complete (not ACTIVE).

Returns:: True if the status is not ACTIVE, indicating completion.
Return type:: bool

`ImageMetadata`

class max.interfaces.ImageMetadata(*, start_idx, end_idx, pixel_values, image_hash=-1)

Metadata about an image in the prompt.

Each image corresponds to a range in the text token array [start_idx, end_idx).

Parameters:

start_idx (int)
end_idx (int)
pixel_values (ndarray[tuple[int, ...], dtype[floating[Any]]])
image_hash (int)

`end_idx`

end_idx: int

One after the index of the last <vision_token_id> special token for the image

`image_hash`

image_hash: int

Hash of the image, for use in prefix caching

`pixel_values`

pixel_values: ndarray[tuple[int, ...], dtype[floating[Any]]]

Pixel values for the image

`start_idx`

start_idx: int

Index of the first <vision_token_id> special token for the image

`LoRAOperation`

class max.interfaces.LoRAOperation(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum for different LoRA operations.

`LOAD`

LOAD = 'load'

`UNLOAD`

UNLOAD = 'unload'

`LoRARequest`

class max.interfaces.LoRARequest(operation, lora_name, lora_path=None)

Container for LoRA adapter requests.

Parameters:

operation (LoRAOperation)
lora_name (str)
lora_path (str | None)

`lora_name`

lora_name: str

`lora_path`

lora_path: str | None

`operation`

operation: LoRAOperation

`LoRAResponse`

class max.interfaces.LoRAResponse(status, message)

Response from LoRA operations.

Parameters:

status (LoRAStatus)
message (str | list[str])

`message`

message: str | list[str]

`status`

status: LoRAStatus

`LoRAStatus`

class max.interfaces.LoRAStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum for LoRA operation status.

`LOAD_ERROR`

LOAD_ERROR = 'load_error'

`LOAD_INVALID_ADAPTER`

LOAD_INVALID_ADAPTER = 'load_invalid_adapter'

`LOAD_INVALID_PATH`

LOAD_INVALID_PATH = 'load_invalid_path'

`LOAD_NAME_EXISTS`

LOAD_NAME_EXISTS = 'load_name_exists'

`SUCCESS`

SUCCESS = 'success'

`UNLOAD_ERROR`

UNLOAD_ERROR = 'unload_error'

`UNLOAD_NAME_NONEXISTENT`

UNLOAD_NAME_NONEXISTENT = 'unload_name_nonexistent'

`UNSPECIFIED_ERROR`

UNSPECIFIED_ERROR = 'unspecified_error'

`LoRAType`

class max.interfaces.LoRAType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enumeration for LoRA Types.

`A`

A = 'lora_A'

Represents the LoRA A matrix (high rank tensor to low rank tensor).

`B`

B = 'lora_B'

Represents the LoRA B matrix. (low rank tensor to high rank tensor)

`BIAS`

BIAS = 'lora.bias'

Represents the LoRA bias matrix. (added to matrix B)

`LogProbabilities`

class max.interfaces.LogProbabilities(token_log_probabilities, top_log_probabilities)

Log probabilities for an individual output token.

This is a data-only class that serves as a serializable data structure for transferring log probability information. It does not provide any functionality for calculating or manipulating log probabilities - it is purely for data storage and serialization purposes.

Parameters:

token_log_probabilities (list[float])
top_log_probabilities (list[dict[int, float]])

`token_log_probabilities`

token_log_probabilities: list[float]

Probabilities of each token.

`top_log_probabilities`

top_log_probabilities: list[dict[int, float]]

Top tokens and their corresponding probabilities.

`MAXPullQueue`

class max.interfaces.MAXPullQueue(*args, **kwargs)

Protocol for a minimal, non-blocking pull queue interface in MAX.

This protocol defines the contract for a queue that supports non-blocking get operations for retrieving items. It is generic over the item type and designed for scenarios where the caller must be immediately notified if no items are available rather than waiting for items to arrive.

The protocol is intended for consumer-side queue operations where immediate feedback about queue state is critical for proper flow control and error handling.

`get_nowait()`

get_nowait()

Remove and return an item from the queue without blocking.

This method is expected to raise queue.Empty if no item is available to retrieve from the queue.

Returns:: The item removed from the queue.
Return type:: PullItemType
Raises:: queue.Empty – If the queue is empty and no item can be retrieved.

`MAXPushQueue`

class max.interfaces.MAXPushQueue(*args, **kwargs)

Protocol for a minimal, non-blocking push queue interface in MAX.

This protocol defines the contract for a queue that supports non-blocking put operations for adding items. It is generic over the item type and designed for scenarios where the caller must be immediately notified of success or failure rather than waiting for space to become available.

The protocol is intended for producer-side queue operations where immediate feedback is critical for proper flow control and error handling.

`put_nowait()`

put_nowait(item)

Attempt to put an item into the queue without blocking.

This method is designed to immediately fail (typically by raising an exception) if the item cannot be added to the queue at the time of the call. Unlike the traditional ‘put’ method in many queue implementations—which may block until space becomes available or the transfer is completed—this method never waits. It is intended for use cases where the caller must be notified of failure to enqueue immediately, rather than waiting for space.

Parameters:: item (PushItemType) – The item to be added to the queue.
Return type:: None

`Pipeline`

class max.interfaces.Pipeline

Abstract base class for pipeline operations.

This generic abstract class defines the interface for pipeline operations that transform inputs of type PipelineInputsType into outputs of type PipelineOutputsDict[PipelineOutputType]. All concrete pipeline implementations must inherit from this class and implement the execute method.

Type Parameters:: PipelineInputsType: The type of inputs this pipeline accepts, must inherit from PipelineInputs PipelineOutputType: The type of outputs this pipeline produces, must be a subclass of PipelineOutput

class MyPipeline(Pipeline[MyInputs, MyOutput]):
    def execute(self, inputs: MyInputs) -> dict[RequestID, MyOutput]:
        # Implementation here
        pass

`execute()`

abstract execute(inputs)

Execute the pipeline operation with the given inputs.

This method must be implemented by all concrete pipeline classes. It takes inputs of the specified type and returns outputs according to the pipeline’s processing logic.

Parameters:: inputs (PipelineInputsType) – The input data for the pipeline operation, must be of type PipelineInputsType
Returns:: The results of the pipeline operation, as a dictionary mapping RequestID to PipelineOutputType
Raises:: NotImplementedError – If not implemented by a concrete subclass
Return type:: dict[RequestID, PipelineOutputType]

`release()`

abstract release(request_id)

Release any resources or state associated with a specific request.

This method should be implemented by concrete pipeline classes to perform cleanup or resource deallocation for the given request ID. It is typically called when a request has completed processing and its associated resources (such as memory, cache, or temporary files) are no longer needed.

Parameters:: request_id (RequestID) – The unique identifier of the request to release resources for.
Returns:: None
Raises:: NotImplementedError – If not implemented by a concrete subclass.
Return type:: None

`PipelineInputs`

class max.interfaces.PipelineInputs

Base class representing inputs to a pipeline operation.

This class serves as a marker interface for all pipeline input types. Concrete implementations should inherit from this class and define the specific input data structures required for their pipeline operations.

class MyPipelineInputs(PipelineInputs):
    def __init__(self, data: str, config: dict):
        self.data = data
        self.config = config

`PipelineOutput`

class max.interfaces.PipelineOutput(*args, **kwargs)

Protocol representing the output of a pipeline operation.

Subclasses must implement the is_done property to indicate whether the pipeline operation has completed.

`is_done`

property is_done: bool

Indicates whether the pipeline operation has completed.

Returns:: True if the operation is done, False otherwise.
Return type:: bool

`PipelineTask`

class max.interfaces.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Enum representing the types of pipeline tasks supported.

`AUDIO_GENERATION`

AUDIO_GENERATION = 'audio_generation'

Task for generating audio.

`EMBEDDINGS_GENERATION`

EMBEDDINGS_GENERATION = 'embeddings_generation'

Task for generating embeddings.

`SPEECH_TOKEN_GENERATION`

SPEECH_TOKEN_GENERATION = 'speech_token_generation'

Task for generating speech tokens.

`TEXT_GENERATION`

TEXT_GENERATION = 'text_generation'

Task for generating text.

`output_type`

property output_type: type[dict[RequestID, SchedulerResult[Any]]]

Get the output type for the pipeline task.

Returns:: The output type for the pipeline task.
Return type:: type

`PipelineTokenizer`

class max.interfaces.PipelineTokenizer(*args, **kwargs)

Interface for LLM tokenizers.

`decode()`

async decode(encoded, **kwargs)

Decodes response tokens to text.

Parameters:: encoded (TokenizerEncoded) – Encoded response tokens.
Returns:: Un-encoded response text.
Return type:: str

`encode()`

async encode(prompt, add_special_tokens)

Encodes text prompts as tokens.

Parameters:

prompt (str) – Un-encoded prompt text.
add_special_tokens (bool)

Raises:

ValueError – If the prompt exceeds the configured maximum length.

Return type:

TokenizerEncoded

`eos`

property eos: int

The end of sequence token for this tokenizer.

`expects_content_wrapping`

property expects_content_wrapping: bool

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

{ "type": "text", "content": "text content" }

instead of the OpenAI spec:

{ "type": "text", "text": "text content" }

NOTE: Multimodal messages omit the content property. Both image_urls and image content parts are converted to:

{ "type": "image" }

Their content is provided as byte arrays through the top-level property on the request object, i.e., RequestType.images.

`new_context()`

async new_context(request)

Creates a new context from a request object. This is sent to the worker process once and then cached locally.

Parameters:: request (RequestType) – Incoming request.
Returns:: Initialized context.
Return type:: UnboundContextType

`PipelinesFactory`

max.interfaces.PipelinesFactory

Type alias for factory functions that create pipeline instances.

Factory functions should return a Pipeline with properly typed inputs and outputs that are bound to the PipelineInputs and PipelineOutput base classes respectively. This ensures type safety while maintaining flexibility for different pipeline implementations.

Example:

def create_text_pipeline() -> Pipeline[TextGenerationInputs, TextGenerationOutput]:: return MyTextGenerationPipeline()

factory: PipelinesFactory = create_text_pipeline

alias of Callable[[], Pipeline[PipelineInputsType, PipelineOutputType]]

`ProcessorInputs`

class max.interfaces.ProcessorInputs(logits: 'md.Tensor', context: 'TextGenerationContext')

Parameters:

logits (md.Tensor)
context (TextGenerationContext)

`context`

context: TextGenerationContext

`logits`

logits: md.Tensor

`Request`

class max.interfaces.Request(request_id)

Base class representing a generic request within the MAX API.

This class provides a unique identifier for each request, ensuring that all requests can be tracked and referenced consistently throughout the system. Subclasses can extend this class to include additional fields specific to their request types.

Parameters:: request_id (RequestID)

`request_id`

request_id: RequestID

`RequestID`

class max.interfaces.RequestID(value=<factory>)

A unique immutable identifier for a request.

When instantiated without arguments, automatically generates a new UUID4-based ID.

Parameters:: value (str) – The string identifier. If not provided, generates a UUID4 hex string.

`value`

value: str

`SamplingParams`

class max.interfaces.SamplingParams(top_k=-1, top_p=1, min_p=0.0, temperature=1, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_new_tokens=None, min_new_tokens=0, ignore_eos=False, stop=None, stop_token_ids=None, detokenize=True, seed=<factory>, logits_processors=None)

Request specific sampling parameters that are only known at run time.

Parameters:

top_k (int)
top_p (float)
min_p (float)
temperature (float)
frequency_penalty (float)
presence_penalty (float)
repetition_penalty (float)
max_new_tokens (int | None)
min_new_tokens (int)
ignore_eos (bool)
stop (list[str] | None)
stop_token_ids (list[int] | None)
detokenize (bool)
seed (int)
logits_processors (Sequence[Callable[[ProcessorInputs], None]] | None)

`detokenize`

detokenize: bool = True

Whether to detokenize the output tokens into text.

`frequency_penalty`

frequency_penalty: float = 0.0

The frequency penalty to apply to the model’s output. A positive value will penalize new tokens based on their frequency in the generated text: tokens will receive a penalty proportional to the count of appearances.

`from_input_and_generation_config()`

classmethod from_input_and_generation_config(input_params, sampling_params_defaults)

Create SamplingParams with defaults from HuggingFace’s GenerationConfig.

This method creates a SamplingParams instance by combining three sources of values, in priority order (highest to lowest):

User-provided values in input_params (non-None)
Model’s GenerationConfig values (only if explicitly set in the model’s config)
SamplingParams class defaults

Parameters:

input_params (SamplingParamsInput) – Dataclass containing user-specified parameter values. Values of None will be replaced with model defaults or class defaults.
sampling_params_defaults (SamplingParamsGenerationConfigDefaults) – SamplingParamsGenerationConfigDefaults containing default sampling parameters extracted from the model’s GenerationConfig.

Returns:

A new SamplingParams instance with model-aware defaults.

Return type:

SamplingParams

Example:

>>> sampling_defaults = model_config.sampling_params_defaults
>>> params = SamplingParams.from_input_and_generation_config(
...     SamplingParamsInput(temperature=0.7),  # User override
...     sampling_params_defaults=sampling_defaults
... )

`ignore_eos`

ignore_eos: bool = False

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

`log_sampling_info()`

log_sampling_info()

Log comprehensive sampling parameters information.

Displays all sampling parameters in a consistent visual format similar to pipeline configuration logging.

Return type:: None

`logits_processors`

logits_processors: Sequence[Callable[[ProcessorInputs], None]] | None = None

Callables to post-process the model logits. See LogitsProcessor for examples.

`max_new_tokens`

max_new_tokens: int | None = None

The maximum number of new tokens to generate in the response.

When set to an integer value, generation will stop after this many tokens. When None (default), the model may generate tokens until it reaches its internal limits or other stopping criteria are met.

`min_new_tokens`

min_new_tokens: int = 0

The minimum number of tokens to generate in the response.

`min_p`

min_p: float = 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

`presence_penalty`

presence_penalty: float = 0.0

The presence penalty to apply to the model’s output. A positive value will penalize new tokens that have already appeared in the generated text at least once by applying a constant penalty.

`repetition_penalty`

repetition_penalty: float = 1.0

The repetition penalty to apply to the model’s output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once by dividing the logits by the repetition penalty.

`seed`

seed: int

The seed to use for the random number generator. Defaults to a cryptographically secure random value.

`stop`

stop: list[str] | None = None

A list of detokenized sequences that can be used as stop criteria when generating a new sequence.

`stop_token_ids`

stop_token_ids: list[int] | None = None

A list of token ids that are used as stopping criteria when generating a new sequence.

`temperature`

temperature: float = 1

Controls the randomness of the model’s output; higher values produce more diverse responses. For greedy sampling, set to temperature to 0.

`top_k`

top_k: int = -1

Limits the sampling to the K most probable tokens. This defaults to -1 (to sample all tokens), for greedy sampling set to 1.

`top_p`

top_p: float = 1

Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

`SamplingParamsGenerationConfigDefaults`

class max.interfaces.SamplingParamsGenerationConfigDefaults(temperature=None, top_p=None, top_k=None, repetition_penalty=None, max_new_tokens=None, min_new_tokens=None, do_sample=None)

Default sampling parameter values extracted from a model’s GenerationConfig.

This class encapsulates sampling parameter defaults that come from a HuggingFace model’s GenerationConfig. These defaults have middle priority when creating SamplingParams instances:

Priority order (highest to lowest):

User-provided values (SamplingParamsInput)
Model’s GenerationConfig values (this class)
SamplingParams class defaults

All fields default to None, indicating that the model’s GenerationConfig does not explicitly set that parameter. When None, SamplingParams will fall back to its own class defaults.

Example:

>>> # Extract from model config
>>> gen_config = model_config.generation_config
>>> defaults = SamplingParamsGenerationConfigDefaults(
...     temperature=0.7,
...     top_k=50,
...     max_new_tokens=512
... )
>>> # Use with SamplingParams
>>> params = SamplingParams.from_input_and_generation_config(
...     SamplingParamsInput(),
...     sampling_params_defaults=defaults
... )

Parameters:

temperature (float | None)
top_p (float | None)
top_k (int | None)
repetition_penalty (float | None)
max_new_tokens (int | None)
min_new_tokens (int | None)
do_sample (bool | None)

`do_sample`

do_sample: bool | None = None

If False, use greedy sampling.

`max_new_tokens`

max_new_tokens: int | None = None

Maximum number of new tokens from the model’s GenerationConfig, if explicitly set.

`min_new_tokens`

min_new_tokens: int | None = None

Minimum number of new tokens from the model’s GenerationConfig, if explicitly set.

`repetition_penalty`

repetition_penalty: float | None = None

Repetition penalty value from the model’s GenerationConfig, if explicitly set.

`temperature`

temperature: float | None = None

Temperature value from the model’s GenerationConfig, if explicitly set.

`top_k`

top_k: int | None = None

Top-k sampling value from the model’s GenerationConfig, if explicitly set.

`top_p`

top_p: float | None = None

Top-p (nucleus sampling) value from the model’s GenerationConfig, if explicitly set.

`values_to_update`

property values_to_update: dict[str, float | int]

Extract non-None field values as a dictionary.

Returns:: A dictionary mapping field names to their values, excluding any fields that are None. This dictionary can be used to update SamplingParams default values.

Example:

>>> defaults = SamplingParamsGenerationConfigDefaults(
...     temperature=0.7,
...     top_k=50
... )
>>> defaults.values_to_update
{'temperature': 0.7, 'top_k': 50}

`SamplingParamsInput`

class max.interfaces.SamplingParamsInput(top_k=None, top_p=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, max_new_tokens=None, min_new_tokens=None, ignore_eos=None, stop=None, stop_token_ids=None, detokenize=None, seed=None, logits_processors=None)

Input dataclass for creating SamplingParams instances.

All fields are optional, allowing partial specification with None values indicating “use default”. This enables static type checking while maintaining the flexibility to specify only the parameters you want to override.

Parameters:

top_k (int | None)
top_p (float | None)
min_p (float | None)
temperature (float | None)
frequency_penalty (float | None)
presence_penalty (float | None)
repetition_penalty (float | None)
max_new_tokens (int | None)
min_new_tokens (int | None)
ignore_eos (bool | None)
stop (list[str] | None)
stop_token_ids (list[int] | None)
detokenize (bool | None)
seed (int | None)
logits_processors (Sequence[Callable[[ProcessorInputs], None]] | None)

`detokenize`

detokenize: bool | None = None

`frequency_penalty`

frequency_penalty: float | None = None

`ignore_eos`

ignore_eos: bool | None = None

`logits_processors`

logits_processors: Sequence[Callable[[ProcessorInputs], None]] | None = None

`max_new_tokens`

max_new_tokens: int | None = None

`min_new_tokens`

min_new_tokens: int | None = None

`min_p`

min_p: float | None = None

`presence_penalty`

presence_penalty: float | None = None

`repetition_penalty`

repetition_penalty: float | None = None

`seed`

seed: int | None = None

`stop`

stop: list[str] | None = None

`stop_token_ids`

stop_token_ids: list[int] | None = None

`temperature`

temperature: float | None = None

`top_k`

top_k: int | None = None

`top_p`

top_p: float | None = None

`Scheduler`

class max.interfaces.Scheduler

Abstract base class defining the interface for schedulers.

`run_iteration()`

abstract run_iteration()

The core scheduler routine that creates and executes batches.

This method should implement the core scheduling logic including:

Batch creation and management
Request scheduling

`SchedulerResult`

class max.interfaces.SchedulerResult(is_done, result)

Structure representing the result of a scheduler operation for a specific pipeline execution.

This class encapsulates the outcome of a pipeline operation as managed by the scheduler, including both the execution status and any resulting data from the pipeline. The scheduler uses this structure to communicate the state of pipeline operations back to clients, whether the operation is still running, has completed successfully, or was cancelled.

The generic type parameter allows this result to work with different types of pipeline outputs while maintaining type safety.

Parameters:

is_done (bool)
result (PipelineOutputType | None)

`cancelled()`

classmethod cancelled()

Create a SchedulerResult representing a cancelled pipeline operation.

Returns:: A SchedulerResult that is done.
Return type:: SchedulerResult

`create()`

classmethod create(result)

Create a SchedulerResult representing a pipeline operation with some result.

Parameters:: result (PipelineOutputType) – The pipeline output data.
Returns:: A SchedulerResult with a result.
Return type:: SchedulerResult

`is_done`

is_done: bool

The current status of the pipeline operation from the scheduler’s perspective.

`result`

result: PipelineOutputType | None

The pipeline output data, if any. May be None for cancelled operations or during intermediate states of streaming operations.

`SharedMemoryArray`

class max.interfaces.SharedMemoryArray(name, shape, dtype)

Wrapper for numpy array stored in shared memory.

This class is used as a placeholder in pixel_values during serialization. It will be encoded as a dict with __shm__ flag and decoded back to a numpy array.

Parameters:

name (str)
shape (tuple[int, ...])
dtype (str)

`TextGenerationContext`

class max.interfaces.TextGenerationContext(*args, **kwargs)

Protocol defining the interface for text generation contexts in token generation.

A TextGenerationContext represents model inputs for text generation pipelines, managing the state of tokens throughout the generation process. It handles token arrays, generation status, sampling parameters, and various indices that track different stages of token processing.

The context maintains a token array with the following layout:

.                      +---------- full prompt ----------+   CHUNK_SIZE*N v
. +--------------------+---------------+-----------------+----------------+
. |     completed      |  next_tokens  |                 |  preallocated  |
. +--------------------+---------------+-----------------+----------------+
.            start_idx ^    active_idx ^         end_idx ^

Token Array Regions:: completed: Tokens that have already been processed and encoded.

next_tokens: Tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill.
preallocated: Token slots that have been preallocated. The token array resizes to multiples of CHUNK_SIZE to accommodate new tokens.

Key Indices:: start_idx: Marks the beginning of uncompleted tokens

active_idx: Marks the start of next_tokens within the array
end_idx: Marks the end of all active tokens (one past the last token)

`active_idx`

property active_idx: int

The index marking the start of next_tokens within the token array.

This index separates completed tokens from tokens that will be processed in the next iteration during chunked prefill or generation.

Returns:: The zero-based index where next_tokens begin in the token array.

`active_length`

property active_length: int

The number of tokens being processed in the current iteration.

During context encoding (prompt processing), this equals the prompt size or chunk size for chunked prefill. During token generation, this is typically 1 (one new token per iteration).

Returns:: The number of tokens being processed in this iteration.

`all_tokens`

property all_tokens: ndarray[tuple[int, ...], dtype[Any]]

All active tokens in the context (prompt and generated).

This property returns only the meaningful tokens, excluding any preallocated but unused slots in the token array.

Returns:: A 1D NumPy array of int32 values containing all prompt and generated tokens.

`bump_token_indices()`

bump_token_indices(start_idx=0, active_idx=0, end_idx=0)

Increment token indices by the specified amounts.

This method provides fine-grained control over token index management, allowing incremental updates to track token processing progress.

Parameters:

start_idx (int) – Amount to increment the start_idx by.
active_idx (int) – Amount to increment the active_idx by.
end_idx (int) – Amount to increment the end_idx by.

Return type:

None

`compute_num_available_steps()`

compute_num_available_steps(max_seq_len)

Compute the maximum number of generation steps available.

This method calculates how many tokens can be generated without exceeding the specified maximum sequence length limit.

Parameters:: max_seq_len (int) – The maximum allowed sequence length for this context.
Returns:: The number of generation steps that can be executed before reaching the sequence length limit.
Return type:: int

`current_length`

property current_length: int

The current total length of the sequence.

This includes both completed tokens and tokens currently being processed, representing the total number of tokens in the active sequence.

Returns:: The total number of tokens including completed and active tokens.

`end_idx`

property end_idx: int

The index marking the end of all active tokens in the token array.

This is an exclusive end index (one past the last active token), following Python’s standard slicing conventions.

Returns:: The zero-based index one position past the last active token.

`eos_token_ids`

property eos_token_ids: set[int]

The set of end-of-sequence token IDs that can terminate generation.

Returns:: A set of token IDs that, when generated, will signal the end of the sequence and terminate the generation process.

`generated_tokens`

property generated_tokens: ndarray[tuple[int, ...], dtype[Any]]

All tokens generated by the model for this context.

This excludes the original prompt tokens and includes only tokens that have been produced during the generation process.

Returns:: A 1D NumPy array of int32 values containing generated token IDs.

`get_last_generated_token()`

get_last_generated_token()

The most recently generated token.

This property returns the token ID of the most recent token that was generated by the model during the generation process. If no tokens have been generated yet, this method will raise an error.

This is not a @property method since it can raise.

Returns:: The token ID of the most recently generated token.
Raises:: ValueError – If no tokens have been generated yet.
Return type:: int

`get_min_token_logit_mask()`

get_min_token_logit_mask(num_steps)

Get token indices that should be masked in the output logits.

This method is primarily used to implement the min_tokens constraint, where certain tokens (typically EOS tokens) are masked to prevent early termination before the minimum token count is reached.

Parameters:: num_steps (int) – The number of generation steps to compute masks for.
Returns:: A list of NumPy arrays, where each array contains token indices that should be masked (set to negative infinity) in the logits for the corresponding generation step.
Return type:: list[ndarray[tuple[int, …], dtype[int32]]]

`is_initial_prompt`

property is_initial_prompt: bool

Whether this context contains only the initial prompt.

This property indicates if the context has not yet been updated with any generated tokens and still contains only the original input.

Returns:: True if no tokens have been generated yet, False if generation has begun and tokens have been added.

`json_schema`

property json_schema: str | None

The JSON schema for constrained decoding, if configured.

When set, this schema constrains token generation to produce valid JSON output that conforms to the specified structure.

Returns:: The JSON schema string, or None if no schema constraint is active.

`jump_ahead()`

jump_ahead(new_token)

Jump ahead in generation by adding a token and updating indices.

This method is used in speculative decoding scenarios to quickly advance the generation state when draft tokens are accepted.

Parameters:: new_token (int) – The token ID to add when jumping ahead in the sequence.
Return type:: None

`log_probabilities`

property log_probabilities: int

The number of top tokens to return log probabilities for.

When greater than 0, the system returns log probabilities for the top N most likely tokens at each generation step.

Returns:: The number of top tokens to include in log probability output. Returns 0 if log probabilities are disabled.

`log_probabilities_echo`

property log_probabilities_echo: bool

Whether to include input tokens in the returned log probabilities.

When True, log probabilities will be computed and returned for input (prompt) tokens in addition to generated tokens.

Returns:: True if input tokens should be included in log probability output, False otherwise.

`matcher`

property matcher: Any | None

The grammar matcher for structured output generation, if configured.

The matcher enforces structural constraints (like JSON schema) during generation to ensure valid formatted output.

Returns:: The grammar matcher instance, or None if no structured generation is configured for this context.

Note

The matcher type depends on the structured generation backend used (e.g., outlines, guidance, etc.). In the future, this should be replaced with a Protocol for better type safety.

`max_length`

property max_length: int | None

The maximum allowed length for this sequence.

When set, generation will stop when this length is reached, regardless of other stopping criteria.

Returns:: The maximum sequence length limit, or None if no limit is set.

`min_tokens`

property min_tokens: int

The minimum number of new tokens that must be generated.

Generation will continue until at least this many new tokens have been produced, even if other stopping criteria are met (e.g., EOS tokens).

Returns:: The minimum number of new tokens to generate.

`needs_ce`

property needs_ce: bool

Returns whether this context needs context encoding (CE).

CE mode indicates that the context has additional prompt tokens to encode.

Returns:: True if the context needs CE, False otherwise.
Return type:: bool

`next_tokens`

property next_tokens: ndarray[tuple[int, ...], dtype[Any]]

The tokens to be processed in the next model iteration.

This array contains the tokens that will be fed to the model in the upcoming forward pass. The length should match active_length.

Returns:: A 1D NumPy array of int32 token IDs with length equal to active_length.

`prompt_tokens`

property prompt_tokens: ndarray[tuple[int, ...], dtype[Any]]

The original prompt tokens for this context.

These are the input tokens that were provided to start the generation process, before any tokens were generated by the model.

Returns:: A 1D NumPy array of int32 values containing the original prompt token IDs.

`reset()`

reset()

Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

Return type:: None

`sampling_params`

property sampling_params: SamplingParams

The sampling parameters configured for this generation request.

These parameters control how tokens are selected during generation, including temperature, top-k/top-p filtering, and stopping criteria.

Returns:: The SamplingParams instance containing all sampling configuration for this context.

`set_matcher()`

set_matcher(matcher)

Set a grammar matcher for constrained decoding.

This method configures structured output generation by installing a grammar matcher that enforces format constraints during token generation.

Parameters:: matcher (Any) – The grammar matcher instance to use for constraining output. The specific type depends on the structured generation backend.
Return type:: None

`set_token_indices()`

set_token_indices(start_idx=None, active_idx=None, end_idx=None)

Set token indices to specific absolute values.

This method provides direct control over token index positioning, allowing precise management of the token array state.

Parameters:

start_idx (int | None) – New absolute value for start_idx, if provided.
active_idx (int | None) – New absolute value for active_idx, if provided.
end_idx (int | None) – New absolute value for end_idx, if provided.

Return type:

None

`start_idx`

property start_idx: int

The index marking the start of completed tokens in the token array.

Completed tokens are those that have already been processed and encoded by the model in previous iterations.

Returns:: The zero-based index where completed tokens begin in the token array.

`to_generation_output()`

to_generation_output()

Convert this context to a TextGenerationOutput object.

This property provides a standardized way to extract the final output of the text generation process from the context, including generated text, tokens, and any associated metadata.

Returns:: The output object containing the results of the text generation for this context.
Return type:: TextGenerationOutput

`tokens`

property tokens: ndarray[tuple[int, ...], dtype[Any]]

The complete token array including preallocated slots.

This includes all tokens (completed, active, and preallocated empty slots). For most use cases, prefer all_tokens to get only the active tokens.

Returns:: A 1D NumPy array of int32 values containing all tokens including padding.

`update()`

update(new_token, log_probabilities=None)

Update the context with a newly generated token, and update status.

This method adds a generated token to the context, updating the token array, associated metadata, and log probabilities (if provided). It is also responsible for updating the context’s generation status and determining if the generation sequence is complete, either due to reaching an end-of-sequence condition or meeting stopping criteria.

Parameters:

new_token (int) – The token ID to add to the generation sequence.
log_probabilities (LogProbabilities | None) – Optional log probability data for the new token and alternatives. Used for analysis and debugging.

Return type:

None

`TextGenerationInputs`

class max.interfaces.TextGenerationInputs(batches, num_steps, input_tokens=-1, batch_type=BatchType.TG)

Input parameters for text generation pipeline operations.

This class encapsulates the batch of contexts and number of steps required for token generation in a single input object, replacing the previous pattern of passing batch and num_steps as separate parameters.

Parameters:

batches (list[dict[RequestID, TextGenerationContextType]])
num_steps (int)
input_tokens (int)
batch_type (BatchType)

`batch`

property batch: dict[RequestID, TextGenerationContextType]

Returns merged batches.

`batch_echo`

property batch_echo: list[bool]

List indicating whether echo is enabled for each context in the batch.

`batch_top_log_probs`

property batch_top_log_probs: list[int]

List of requested top log probabilities per context in the batch.

`batch_type`

batch_type: BatchType = 'TG'

Type of batch.

`batches`

batches: list[dict[RequestID, TextGenerationContextType]]

Variable list of batches, with each batch being a dictionary mapping request IDs to context objects.

There can be multiple batches when using data parallelism, in which each batch is mapped to a different device.

`enable_echo`

property enable_echo: bool

Return True if any context in the batch has echo enabled.

`enable_log_probs`

property enable_log_probs: bool

Return True if any context in the batch requests log probabilities.

`input_tokens`

input_tokens: int = -1

Number of input tokens.

`num_steps`

num_steps: int

Number of steps to run for.

`TextGenerationOutput`

class max.interfaces.TextGenerationOutput(request_id, tokens, final_status, log_probabilities=None)

Represents the output of a text generation operation, combining token IDs, final generation status, request ID, and optional log probabilities for each token.

Parameters:

request_id (RequestID)
tokens (list[int])
final_status (GenerationStatus)
log_probabilities (list[LogProbabilities] | None)

`final_status`

final_status: GenerationStatus

The final status of the generation process.

`is_done`

property is_done: bool

Indicates whether the text generation process is complete.

Returns:: True if the generation is done, False otherwise.
Return type:: bool

`log_probabilities`

log_probabilities: list[LogProbabilities] | None

Optional list of log probabilities for each token.

`request_id`

request_id: RequestID

The unique identifier for the generation request.

`tokens`

tokens: list[int]

List of generated token IDs.

`TextGenerationRequest`

class max.interfaces.TextGenerationRequest(request_id: 'RequestID', model_name: 'str', prompt: 'str | Sequence[int] | None' = None, messages: 'list[TextGenerationRequestMessage] | None' = None, images: 'list[bytes] | None' = None, tools: 'list[TextGenerationRequestTool] | None' = None, response_format: 'TextGenerationResponseFormat | None' = None, timestamp_ns: 'int' = 0, request_path: 'str' = '/', logprobs: 'int' = 0, echo: 'bool' = False, stop: 'str | list[str] | None' = None, chat_template_options: 'dict[str, Any] | None' = None, sampling_params: 'SamplingParams' = <factory>, target_endpoint: 'str | None' = None)

Parameters:

request_id (RequestID)
model_name (str)
prompt (str | Sequence[int] | None)
messages (list[TextGenerationRequestMessage] | None)
images (list[bytes] | None)
tools (list[TextGenerationRequestTool] | None)
response_format (TextGenerationResponseFormat | None)
timestamp_ns (int)
request_path (str)
logprobs (int)
echo (bool)
stop (str | list[str] | None)
chat_template_options (dict[str, Any] | None)
sampling_params (SamplingParams)
target_endpoint (str | None)

`chat_template_options`

chat_template_options: dict[str, Any] | None = None

Optional dictionary of options to pass when applying the chat template.

`echo`

echo: bool = False

If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.

`images`

images: list[bytes] | None = None

A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.

`logprobs`

logprobs: int = 0

The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.

`messages`

messages: list[TextGenerationRequestMessage] | None = None

A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.

`model_name`

model_name: str

The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

`prompt`

prompt: str | Sequence[int] | None = None

The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.

`request_path`

request_path: str = '/'

The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.

`response_format`

response_format: TextGenerationResponseFormat | None = None

Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.

`sampling_params`

sampling_params: SamplingParams

Token sampling configuration parameters for the request.

`stop`

stop: str | list[str] | None = None

//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

Type:: Optional list of stop expressions (see
Type:: https

`target_endpoint`

target_endpoint: str | None = None

Optional target endpoint identifier for routing the request to a specific service or model instance. This should be used in disaggregate serving scenarios, when you want to dynamically route to a specific instance. If not specified, the request will be routed to the default endpoint.

`timestamp_ns`

timestamp_ns: int = 0

The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.

`tools`

tools: list[TextGenerationRequestTool] | None = None

A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.

`TextGenerationRequestFunction`

class max.interfaces.TextGenerationRequestFunction

Represents a function definition for a text generation request.

`description`

description: str | None

A human-readable description of the function’s purpose.

`name`

name: str

The name of the function to be invoked.

`parameters`

parameters: dict[str, Any]

A dictionary describing the function’s parameters, typically following a JSON schema.

`TextGenerationRequestMessage`

class max.interfaces.TextGenerationRequestMessage

`content`

content: str | list[dict[str, Any]]

Content can be a simple string or a list of message parts of different modalities.

For example:

{
  "role": "user",
  "content": "What's the weather like in Boston today?"
}

Or:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
          "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
      }
    }
  ]
}

`role`

role: Literal['system', 'user', 'assistant', 'tool', 'function']

The role of the message sender, indicating whether the message is from the system, user, or assistant.

`TextGenerationRequestTool`

class max.interfaces.TextGenerationRequestTool

Represents a tool definition for a text generation request.

`function`

function: TextGenerationRequestFunction

The function definition associated with the tool, including its name, description, and parameters.

`type`

type: str

The type of the tool, typically indicating the tool’s category or usage.

`TextGenerationResponseFormat`

class max.interfaces.TextGenerationResponseFormat

Represents the response format specification for a text generation request.

`json_schema`

json_schema: dict[str, Any]

A JSON schema dictionary that defines the structure and validation rules for the generated response.

`type`

type: str

The type of response format, e.g., “json_object”.

`VLMTextGenerationContext`

class max.interfaces.VLMTextGenerationContext(*args, **kwargs)

Protocol defining the interface for VLM input contexts.

`compute_image_aligned_idx()`

compute_image_aligned_idx(idx)

Possibly aligns a index value downward if it lies in the middle of an image.

Parameters:: idx (int)
Return type:: int

`image_idx`

property image_idx: int

Index of the next unencoded image in the prompt.

`images`

property images: list[ImageMetadata]

Returns the images in the context.

`needs_vision_encoding`

property needs_vision_encoding: bool

Returns whether vision encoding is needed for this context.

`next_images`

property next_images: list[ImageMetadata]

Returns the images that are not yet encoded.

`drain_queue()`

max.interfaces.drain_queue(pull_queue, max_items=None)

Remove and return items from the queue without blocking.

This method is expected to return an empty list if the queue is empty. If max_items is specified, at most that many items will be returned.

Parameters:

pull_queue (MAXPullQueue[PullItemType]) – The queue to drain items from.
max_items (int | None) – Maximum number of items to return. If None, returns all items.

Returns:

List of items removed from the queue, limited by max_items if specified.

Return type:

list[PullItemType]

`get_blocking()`

max.interfaces.get_blocking(pull_queue)

Get the next item from the queue.

If no item is available, this method will spin until one is.

Parameters:: pull_queue (MAXPullQueue[PullItemType])
Return type:: PullItemType

`msgpack_eq()`

max.interfaces.msgpack_eq(a, b)

Compare two msgpack-serializable objects for equality. This should really only be used in tests.

Parameters:

a (Any) – The first object to compare
b (Any) – The second object to compare

Return type:

bool

`msgpack_numpy_decoder()`

max.interfaces.msgpack_numpy_decoder(type_, copy=True)

Create a decoder function for the specified type.

Parameters:

type – The type to decode into
copy (bool) – Copy numpy arrays if true. Defaults to True. Copy is set to True by default because most downstream usage of deserialized tensors are MAX driver tensors, which require owned numpy arrays. This is a constraint imposed by dlpack & numpy where we cannot create a buffer from read-only data. While there is a performance benefit during deserialization to removing copies by default, this often just moves the work downstream to an implicit copy during Tensor.from_numpy. As a result, it is easier to make the copy explicit here and maintain the pattern that all numpy arrays used in MAX are owned by the current process.
type_ (Any)

Returns:

A pickleable decoder instance that decodes bytes into the specified type

Return type:

MsgpackNumpyDecoder

`msgpack_numpy_encoder()`

max.interfaces.msgpack_numpy_encoder(use_shared_memory=False, shared_memory_threshold=0)

Create an encoder function that handles numpy arrays.

Parameters:

use_shared_memory (bool) – Whether to attempt shared memory conversion for numpy arrays
shared_memory_threshold (int) – Minimum size in bytes for shared memory conversion. If 0, all arrays are candidates for conversion.

Returns:

A pickleable encoder instance that encodes objects into bytes

Return type:

MsgpackNumpyEncoder

AudioGenerationInputs
- batch
AudioGenerationMetadata
- chunk_id
- duration
- echo
- final_chunk
- model_name
- processing_time
- request_id
- sample_rate
- timestamp
- to_dict()
- tokens_generated
AudioGenerationOutput
- audio_data
- buffer_speech_tokens
- final_status
- is_done
- metadata
- steps_executed
AudioGenerationRequest
- audio_prompt_tokens
- audio_prompt_transcription
- buffer_speech_tokens
- input
- model
- prompt
- sampling_params
- streaming
BaseContext
- is_done
- request_id
- status
BatchProcessorInputs
- context_batch
- logit_offsets
- logits
BatchType
- CE
- TG
EmbeddingsContext
- active_length
- model_name
- tokens
EmbeddingsGenerationInputs
- batch
- batches
EmbeddingsGenerationOutput
- embeddings
- is_done
GenerationStatus
- ACTIVE
- CANCELLED
- END_OF_SEQUENCE
- MAXIMUM_LENGTH
- is_done
ImageMetadata
- end_idx
- image_hash
- pixel_values
- start_idx
LoRAOperation
- LOAD
- UNLOAD
LoRARequest
- lora_name
- lora_path
- operation
LoRAResponse
- message
- status
LoRAStatus
- LOAD_ERROR
- LOAD_INVALID_ADAPTER
- LOAD_INVALID_PATH
- LOAD_NAME_EXISTS
- SUCCESS
- UNLOAD_ERROR
- UNLOAD_NAME_NONEXISTENT
- UNSPECIFIED_ERROR
LoRAType
- A
- B
- BIAS
LogProbabilities
- token_log_probabilities
- top_log_probabilities
MAXPullQueue
- get_nowait()
MAXPushQueue
- put_nowait()
Pipeline
- execute()
- release()
PipelineInputs
PipelineOutput
- is_done
PipelineTask
- AUDIO_GENERATION
- EMBEDDINGS_GENERATION
- SPEECH_TOKEN_GENERATION
- TEXT_GENERATION
- output_type
PipelineTokenizer
- decode()
- encode()
- eos
- expects_content_wrapping
- new_context()
PipelinesFactory
ProcessorInputs
- context
- logits
Request
- request_id
RequestID
- value
SamplingParams
- detokenize
- frequency_penalty
- from_input_and_generation_config()
- ignore_eos
- log_sampling_info()
- logits_processors
- max_new_tokens
- min_new_tokens
- min_p
- presence_penalty
- repetition_penalty
- seed
- stop
- stop_token_ids
- temperature
- top_k
- top_p
SamplingParamsGenerationConfigDefaults
- do_sample
- max_new_tokens
- min_new_tokens
- repetition_penalty
- temperature
- top_k
- top_p
- values_to_update
SamplingParamsInput
- detokenize
- frequency_penalty
- ignore_eos
- logits_processors
- max_new_tokens
- min_new_tokens
- min_p
- presence_penalty
- repetition_penalty
- seed
- stop
- stop_token_ids
- temperature
- top_k
- top_p
Scheduler
- run_iteration()
SchedulerResult
- cancelled()
- create()
- is_done
- result
SharedMemoryArray
TextGenerationContext
- active_idx
- active_length
- all_tokens
- bump_token_indices()
- compute_num_available_steps()
- current_length
- end_idx
- eos_token_ids
- generated_tokens
- get_last_generated_token()
- get_min_token_logit_mask()
- is_initial_prompt
- json_schema
- jump_ahead()
- log_probabilities
- log_probabilities_echo
- matcher
- max_length
- min_tokens
- needs_ce
- next_tokens
- prompt_tokens
- reset()
- sampling_params
- set_matcher()
- set_token_indices()
- start_idx
- to_generation_output()
- tokens
- update()
TextGenerationInputs
- batch
- batch_echo
- batch_top_log_probs
- batch_type
- batches
- enable_echo
- enable_log_probs
- input_tokens
- num_steps
TextGenerationOutput
- final_status
- is_done
- log_probabilities
- request_id
- tokens
TextGenerationRequest
- chat_template_options
- echo
- images
- logprobs
- messages
- model_name
- prompt
- request_path
- response_format
- sampling_params
- stop
- target_endpoint
- timestamp_ns
- tools
TextGenerationRequestFunction
- description
- name
- parameters
TextGenerationRequestMessage
- content
- role
TextGenerationRequestTool
- function
- type
TextGenerationResponseFormat
- json_schema
- type
VLMTextGenerationContext
- compute_image_aligned_idx()
- image_idx
- images
- needs_vision_encoding
- next_images
drain_queue()
get_blocking()
msgpack_eq()
msgpack_numpy_decoder()
msgpack_numpy_encoder()

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

AudioGenerationInputs​

batch​

AudioGenerationMetadata​

chunk_id​

duration​

echo​

final_chunk​

model_name​

processing_time​

request_id​

sample_rate​

timestamp​

to_dict()​

tokens_generated​

AudioGenerationOutput​

audio_data​

buffer_speech_tokens​

final_status​

is_done​

metadata​

steps_executed​

AudioGenerationRequest​

audio_prompt_tokens​

audio_prompt_transcription​

buffer_speech_tokens​

input​

model​

prompt​

sampling_params​

streaming​

BaseContext​

is_done​

request_id​

status​

BatchProcessorInputs​

context_batch​

logit_offsets​

logits​

BatchType​

CE​

TG​

EmbeddingsContext​

active_length​

model_name​

tokens​

EmbeddingsGenerationInputs​

batch​

batches​

EmbeddingsGenerationOutput​

embeddings​

is_done​

GenerationStatus​

ACTIVE​

CANCELLED​

END_OF_SEQUENCE​

MAXIMUM_LENGTH​

is_done​

ImageMetadata​

end_idx​

image_hash​

pixel_values​

start_idx​

LoRAOperation​

LOAD​

UNLOAD​

LoRARequest​

lora_name​

lora_path​

operation​

LoRAResponse​

message​

status​

LoRAStatus​

LOAD_ERROR​

LOAD_INVALID_ADAPTER​

LOAD_INVALID_PATH​

LOAD_NAME_EXISTS​

SUCCESS​

UNLOAD_ERROR​

UNLOAD_NAME_NONEXISTENT​

`AudioGenerationInputs`

`batch`

`AudioGenerationMetadata`

`chunk_id`

`duration`

`echo`

`final_chunk`

`model_name`

`processing_time`

`request_id`

`sample_rate`

`timestamp`

`to_dict()`

`tokens_generated`

`AudioGenerationOutput`

`audio_data`

`buffer_speech_tokens`

`final_status`

`is_done`

`metadata`

`steps_executed`

`AudioGenerationRequest`

`audio_prompt_tokens`

`audio_prompt_transcription`

`buffer_speech_tokens`

`input`

`model`

`prompt`

`sampling_params`

`streaming`

`BaseContext`

`is_done`

`request_id`

`status`

`BatchProcessorInputs`

`context_batch`

`logit_offsets`

`logits`

`BatchType`

`CE`

`TG`

`EmbeddingsContext`

`active_length`

`model_name`

`tokens`

`EmbeddingsGenerationInputs`

`batch`

`batches`

`EmbeddingsGenerationOutput`

`embeddings`

`is_done`

`GenerationStatus`

`ACTIVE`

`CANCELLED`

`END_OF_SEQUENCE`

`MAXIMUM_LENGTH`

`is_done`

`ImageMetadata`

`end_idx`

`image_hash`

`pixel_values`

`start_idx`

`LoRAOperation`

`LOAD`

`UNLOAD`

`LoRARequest`

`lora_name`

`lora_path`

`operation`

`LoRAResponse`

`message`

`status`

`LoRAStatus`

`LOAD_ERROR`

`LOAD_INVALID_ADAPTER`

`LOAD_INVALID_PATH`

`LOAD_NAME_EXISTS`

`SUCCESS`

`UNLOAD_ERROR`

`UNLOAD_NAME_NONEXISTENT`