Skip to main content
Log in

Python module

context

EmbeddingsGenerator

class max.pipelines.core.EmbeddingsGenerator(*args, **kwargs)

Interface for LLM embeddings-generator models.

encode()

encode(batch: dict[str, EmbeddingsGeneratorContext]) → dict[str, Any]

Computes embeddings for a batch of inputs.

  • Parameters:

    batch (dict[str, EmbeddingsGeneratorContext]) – Batch of contexts to generate embeddings for.

  • Returns:

    Dictionary mapping request IDs to their corresponding : embeddings. Each embedding is typically a numpy array or tensor of floating point values.

  • Return type:

    dict[str, Any]

EmbeddingsResponse

class max.pipelines.core.EmbeddingsResponse(embeddings: ndarray)

Container for the response from embeddings pipeline.

embeddings

embeddings*: ndarray*

InputContext

class max.pipelines.core.InputContext(*args, **kwargs)

A base class for model contexts, represent model inputs for TokenGenerators.

Token array layout: . +———- full prompt ———-+ CHUNK_SIZE*N v . +——————–+—————+—————–+—————-+ . | completed | next_tokens | | preallocated | . +——————–+—————+—————–+—————-+ . start_idx ^ active_idx ^ end_idx ^

  • completed: The tokens that have already been processed and encoded.
  • next_tokens: The tokens that will be processed in the next iteration. : This may be a subset of the full prompt due to chunked prefill.
  • preallocated: The token slots that have been preallocated. The token array : resizes to multiples of CHUNK_SIZE to accommodate the new tokens.

active_idx

property active_idx*: int*

active_length

property active_length*: int*

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 for token generation.

  • Type:

    Current sequence length

assign_to_cache()

assign_to_cache(cache_seq_id: int) → None

Assigns the context to a cache slot.

bump_token_indices()

bump_token_indices(start_idx: int = 0, active_idx: int = 0, end_idx: int = 0, committed_idx: int = 0) → None

Update the start_idx, active_idx and end_idx without manipulating the token array.

cache_seq_id

property cache_seq_id*: int*

Returns the cache slot assigned to the context, raising an error if not assigned.

committed_idx

property committed_idx*: int*

compute_num_available_steps()

compute_num_available_steps(max_seq_len: int) → int

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

current_length

property current_length*: int*

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx*: int*

ignore_eos

property ignore_eos*: bool*

is_assigned_to_cache

property is_assigned_to_cache*: bool*

Returns True if input is assigned to a cache slot, False otherwise.

json_schema

property json_schema*: str | None*

A json schema to use during constrained decoding.

jump_ahead()

jump_ahead(new_token: int, is_eos: bool = False) → None

Updates the token array, while ensuring the new token is returned to the user.

log_probabilities

property log_probabilities*: int*

When > 0, returns the log probabilities for the top N tokens for each element token in the sequence.

log_probabilities_echo

property log_probabilities_echo*: bool*

When True, the input tokens are added to the returned logprobs.

matcher

property matcher*: xgr.GrammarMatcher | None*

An optional xgr Grammar Matcher provided when using structured output.

max_length

property max_length*: int | None*

The maximum length of this sequence.

next_tokens

property next_tokens*: ndarray*

The next prompt tokens to be input during this iteration.

This should be a 1D array of tokens of length active_length.

outstanding_completion_tokens()

outstanding_completion_tokens() → list[tuple[int, Optional[max.pipelines.core.interfaces.response.LogProbabilities]]]

Return the list of outstanding completion tokens and log probabilities that must be returned to the user.

reset()

reset() → None

Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

rollback()

rollback(idx: int) → None

Rollback and remove the last idx tokens.

set_draft_offset()

set_draft_offset(idx: int) → None

set_matcher()

set_matcher(matcher: xgr.GrammarMatcher) → None

Set a grammar matcher for use during constrained decoding.

set_token_indices()

set_token_indices(start_idx: int | None = None, active_idx: int | None = None, end_idx: int | None = None, committed_idx: int | None = None) → None

Set the token indices without manipulating the token array.

start_idx

property start_idx*: int*

tokens

property tokens*: ndarray*

All tokens in the context.

unassign_from_cache()

unassign_from_cache() → None

Unassigns the context from a cache slot.

update()

update(new_token: int, log_probabilities: LogProbabilities | None = None, is_eos: bool = False) → None

Updates the next_tokens and extends existing tokens to include all generated tokens.

LogProbabilities

class max.pipelines.core.LogProbabilities(token_log_probabilities: list[float], top_log_probabilities: list[dict[int, float]])

Log probabilities for an individual output token.

token_log_probabilities

token_log_probabilities

Probabilities of each token.

top_log_probabilities

top_log_probabilities

Top tokens and their corresponding probabilities.

PipelineTask

class max.pipelines.core.PipelineTask(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

EMBEDDINGS_GENERATION

EMBEDDINGS_GENERATION = 'embeddings_generation'

TEXT_GENERATION

TEXT_GENERATION = 'text_generation'

PipelineTokenizer

class max.pipelines.core.PipelineTokenizer(*args, **kwargs)

Interface for LLM tokenizers.

decode()

async decode(context: TokenGeneratorContext, encoded: TokenizerEncoded, **kwargs) → str

Decodes response tokens to text.

  • Parameters:

    • context (TokenGeneratorContext) – Current generation context.
    • encoded (TokenizerEncoded) – Encoded response tokens.
  • Returns:

    Un-encoded response text.

  • Return type:

    str

encode()

async encode(prompt: str, add_special_tokens: bool) → TokenizerEncoded

Encodes text prompts as tokens.

  • Parameters:

    prompt (str) – Un-encoded prompt text.

  • Raises:

    ValueError – If the prompt exceeds the configured maximum length.

eos

property eos*: int*

The end of sequence token for this tokenizer.

expects_content_wrapping

property expects_content_wrapping*: bool*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

{ "type": "text", "content": "text content" }
{ "type": "text", "content": "text content" }

instead of the OpenAI spec:

{ "type": "text", "text": "text content" }
{ "type": "text", "text": "text content" }

NOTE: Multimodal messages omit the content property. Both image_urls and image content parts are converted to:

{ "type": "image" }
{ "type": "image" }

Their content is provided as byte arrays through the top-level property on the request object, i.e., TokenGeneratorRequest.images.

new_context()

async new_context(request: TokenGeneratorRequest) → TokenGeneratorContext

Creates a new context from a request object. This is sent to the worker process once and then cached locally.

  • Parameters:

    request (TokenGeneratorRequest) – Incoming request.

  • Returns:

    Initialized context.

  • Return type:

    TokenGeneratorContext

TextAndVisionContext

class max.pipelines.core.TextAndVisionContext(cache_seq_id: int, prompt: str | Sequence[int], max_length: int | None, tokens: ndarray, pixel_values: Sequence[ndarray], extra_model_args: dict[str, Any], log_probabilities: int = 0, log_probabilities_echo: bool = False, json_schema: str | None = None, ignore_eos: bool = False)

A base class for model context, specifically for Vision model variants.

update()

update(new_token: int, log_probabilities: LogProbabilities | None = None, is_eos: bool = False) → None

Updates the next_tokens and extends existing tokens to include all generated tokens.

TextContext

class max.pipelines.core.TextContext(prompt: str | Sequence[int], max_length: int | None, tokens: ndarray, cache_seq_id: int | None = None, log_probabilities: int = 0, log_probabilities_echo: bool = False, json_schema: str | None = None, ignore_eos: bool = False)

A base class for model context, specifically for Text model variants.

active_idx

property active_idx*: int*

active_length

property active_length*: int*

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 (or more) for token generation.

  • Type:

    Current sequence length

assign_to_cache()

assign_to_cache(cache_seq_id: int) → None

bump_token_indices()

bump_token_indices(start_idx: int = 0, active_idx: int = 0, end_idx: int = 0, committed_idx: int = 0) → None

Update the start_idx, active_idx and end_idx without manipulating the token array.

cache_seq_id

property cache_seq_id*: int*

committed_idx

property committed_idx*: int*

compute_num_available_steps()

compute_num_available_steps(max_seq_len: int) → int

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

current_length

property current_length*: int*

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx*: int*

is_assigned_to_cache

property is_assigned_to_cache*: bool*

jump_ahead()

jump_ahead(new_token: int, is_eos: bool = False) → None

Updates the token array, while ensuring the new token is returned to the user.

next_tokens

property next_tokens*: ndarray*

outstanding_completion_tokens()

outstanding_completion_tokens() → list[tuple[int, Optional[max.pipelines.core.interfaces.response.LogProbabilities]]]

Return the list of outstanding completion tokens and log probabilities that must be returned to the user.

reset()

reset() → None

Resets the context’s state by combining all tokens into a new prompt.

rollback()

rollback(idx: int) → None

set_draft_offset()

set_draft_offset(idx: int) → None

set_matcher()

set_matcher(matcher: xgr.GrammarMatcher) → None

set_token_indices()

set_token_indices(start_idx: int | None = None, active_idx: int | None = None, end_idx: int | None = None, committed_idx: int | None = None) → None

Set the token indices without manipulating the token array.

start_idx

property start_idx*: int*

tokens

property tokens*: ndarray*

unassign_from_cache()

unassign_from_cache() → None

update()

update(new_token: int, log_probabilities: LogProbabilities | None = None, is_eos: bool = False) → None

Updates the next_tokens and extends existing tokens to include all generated tokens.

TextGenerationResponse

class max.pipelines.core.TextGenerationResponse(tokens: list[max.pipelines.core.interfaces.response.TextResponse], final_status: TextGenerationStatus)

append_token()

append_token(token: TextResponse) → None

final_status

property final_status*: TextGenerationStatus*

is_done

property is_done*: bool*

tokens

property tokens*: list[max.pipelines.core.interfaces.response.TextResponse]*

update_status()

update_status(status: TextGenerationStatus) → None

TextGenerationStatus

class max.pipelines.core.TextGenerationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

ACTIVE

ACTIVE = 'active'

END_OF_SEQUENCE

END_OF_SEQUENCE = 'end_of_sequence'

MAXIMUM_LENGTH

MAXIMUM_LENGTH = 'maximum_length'

is_done

property is_done*: bool*

TextResponse

class max.pipelines.core.TextResponse(next_token: int | str, log_probabilities: LogProbabilities | None = None)

A base class for model response, specifically for Text model variants.

next_token

next_token

Encoded predicted next token.

log_probabilities

log_probabilities

Log probabilities of each output token.

TokenGenerator

class max.pipelines.core.TokenGenerator(*args, **kwargs)

Interface for LLM token-generator models.

next_token()

next_token(batch: dict[str, TokenGeneratorContext], num_steps: int) → dict[str, max.pipelines.core.interfaces.response.TextGenerationResponse]

Computes the next token response for a single batch.

  • Parameters:

    • batch (dict[str, TokenGeneratorContext]) – Batch of contexts.
    • int (num_steps) – Number of tokens to generate.
  • Returns:

    List of encoded responses (indexed by req. ID)

  • Return type:

    list[dict[str, TextResponse]]

release()

release(context: TokenGeneratorContext) → None

Releases resources associated with this context.

  • Parameters:

    context (TokenGeneratorContext) – Finished context.

TokenGeneratorRequest

class max.pipelines.core.TokenGeneratorRequest(id: str, index: int, model_name: str, prompt: str | collections.abc.Sequence[int] | NoneType = None, messages: list[max.pipelines.core.interfaces.text_generation.TokenGeneratorRequestMessage] | None = None, images: list[bytes] | None = None, tools: list[max.pipelines.core.interfaces.text_generation.TokenGeneratorRequestTool] | None = None, response_format: max.pipelines.core.interfaces.text_generation.TokenGeneratorResponseFormat | None = None, max_new_tokens: int | None = None, timestamp_ns: int = 0, request_path: str = '/', logprobs: int = 0, echo: bool = False, stop: str | list[str] | NoneType = None, ignore_eos: bool = False)

echo

echo*: bool* = False

If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output.

id

id*: str*

A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking.

ignore_eos

ignore_eos*: bool* = False

If set to True, the response will ignore the EOS token, and continue to generate until the Max tokens or a stop string is hit.

images

images*: list[bytes] | None* = None

A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task.

index

index*: int*

The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately.

logprobs

logprobs*: int* = 0

The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions.

max_new_tokens

max_new_tokens*: int | None* = None

The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria.

messages

messages*: list[max.pipelines.core.interfaces.text_generation.TokenGeneratorRequestMessage] | None* = None

A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages.

model_name

model_name*: str*

The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation.

prompt

prompt*: str | Sequence[int] | None* = None

The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field.

request_path

request_path*: str* = '/'

The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure.

response_format

response_format*: TokenGeneratorResponseFormat | None* = None

Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json_schema provided.

stop

stop*: str | list[str] | None* = None

//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

  • Type:

    Optional list of stop expressions (see

  • Type:

    https

timestamp_ns

timestamp_ns*: int* = 0

The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes.

tools

tools*: list[max.pipelines.core.interfaces.text_generation.TokenGeneratorRequestTool] | None* = None

A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses.

TokenGeneratorRequestFunction

class max.pipelines.core.TokenGeneratorRequestFunction

description

description*: str*

name

name*: str*

parameters

parameters*: dict*

TokenGeneratorRequestMessage

class max.pipelines.core.TokenGeneratorRequestMessage

content

content*: str | list[dict[str, Any]]*

Content can be simple string or a list of message parts of different modalities.

For example:

{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}
{
"role": "user",
"content": "What'''s the weather like in Boston today?"
}

Or:

{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}

role

role*: Literal['system', 'user', 'assistant']*

TokenGeneratorRequestTool

class max.pipelines.core.TokenGeneratorRequestTool

function

function*: TokenGeneratorRequestFunction*

type

type*: str*

TokenGeneratorResponseFormat

class max.pipelines.core.TokenGeneratorResponseFormat

json_schema

json_schema*: dict*

type

type*: str*