Skip to main content

Python module

pipeline

MAX pipeline for model inference and generation (Text Generation variant).

BatchInfo

class max.pipelines.lib.pipeline_variants.text_generation.BatchInfo(past_seq_lens, seq_lens, num_steps)

Information about a batch of requests passed to the pipeline.

Parameters:

num_steps

num_steps: int

Number of steps to do in the pipeline

past_seq_lens

past_seq_lens: list[int]

Coordinated list of past sequence lengths (i.e. context lengths)

seq_lens

seq_lens: list[int]

Coordinated list of sequence lengths, i.e. prompt_len or 1

TextGenerationPipeline

class max.pipelines.lib.pipeline_variants.text_generation.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)

Generalized token generator pipeline.

Parameters:

execute()

execute(inputs)

Processes the batch and returns decoded tokens.

Given a batch, executes the graph for num_steps in a multi-step scenario, then decodes the tokens and returns the list of decoded tokens.

Parameters:

inputs (TextGenerationInputs[TextGenerationContextType])

Return type:

dict[RequestID, TextGenerationOutput]

initialize_bitmask()

initialize_bitmask(batch)

Allocates a per-request token bitmask for structured decoding.

Parameters:

batch (list[TextGenerationContextType]) – The generation contexts for the batch.

Returns:

A bitmask array of shape [batch_size, vocab_size] if structured output is enabled; otherwise None.

Return type:

ndarray[tuple[Any, …], dtype[int32]] | None

kv_managers

property kv_managers: list[Any]

Return the list of KV cache managers backing this pipeline.

pipeline_config

property pipeline_config: PipelineConfig

Return the pipeline configuration.

prepare_batch()

prepare_batch(batches, num_steps)

Prepare model inputs and ancillary state for multi-step execution.

This flattens replica batches, optionally initializes constrained decoding bitmasks, ensures KV-cache reservations, clamps num_steps per context, and builds initial model inputs.

Parameters:

  • batches (list[list[TextGenerationContextType]]) – Per-replica list of contexts.
  • num_steps (int) – Desired number of steps to run.

Returns:

  • ModelInputs: Prepared inputs for the first step.
  • int: The clamped number of steps to run.
  • Optional[np.ndarray]: The structured decoding bitmask or None.
  • list[TextGenerationContextType]: The flattened context batch.

Return type:

A tuple of

release()

release(request_id)

Mark the context as complete, releasing the cache slot from the KV manager.

Note: KV cache lifecycle is now managed by the scheduler. This method is kept for interface compatibility but is a no-op for regular pipelines.

Parameters:

request_id (RequestID)

Return type:

None

tokenizer

property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]

Return the tokenizer used for building contexts and decoding.

update_for_structured_output()

update_for_structured_output(context, bitmask, index)

Update context and logits bitmask for structured output.

If a json_schema is present and no matcher is set, this compiles a grammar matcher and installs it on the context. It may also jump ahead in generation and fills the per-request token bitmask used to constrain the next-token distribution.

Parameters:

  • context (TextGenerationContextType) – Request context to update.
  • bitmask (ndarray[tuple[Any, ...], dtype[int32]]) – Optional preallocated bitmask buffer; updated in-place.
  • index (int) – Global position into the bitmask for this request.

Raises:

ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.

Return type:

None

TextGenerationPipelineInterface

class max.pipelines.lib.pipeline_variants.text_generation.TextGenerationPipelineInterface(*args, **kwargs)

Interface for text generation pipelines.

StandaloneSpeculativeDecodingPipeline

final class max.pipelines.lib.speculative_decoding.StandaloneSpeculativeDecodingPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, draft_pipeline_model=None, draft_weight_adapters=None)

Bases: SpeculativeDecodingPipelineBase

Standalone speculative decoding where draft model runs independently.

In this approach, the draft model generates tokens without any information from the target model, then the target model verifies these tokens.

Parameters:

execute()

execute(inputs)

Execute standalone speculative decoding.

In standalone mode:

  1. Draft model generates tokens independently
  2. Target model verifies draft tokens
  3. Apply rejection sampling to accept/reject tokens

Parameters:

inputs (TextGenerationInputs[TextContext])

Return type:

dict[RequestID, TextGenerationOutput]

generate_draft_tokens()

generate_draft_tokens(batch, num_steps, model_inputs)

Generates draft tokens for the batch using the draft model.

Parameters:

Return type:

tuple[int, Buffer, Buffer, ModelInputs, Buffer]

prepare_batch()

prepare_batch(model, batch, replica_batches, num_steps, return_n_logits, is_draft=False, draft_inputs=None, merged_draft_tokens=None, merged_draft_offsets=None)

Prepares batch inputs and KV cache for draft or target model.

Parameters:

Return type:

tuple[ModelInputs, int]

verify_draft_tokens_with_target_model()

verify_draft_tokens_with_target_model(draft_inputs, context_batch, replica_batches, num_draft_tokens_generated, draft_tokens, draft_logits, merged_draft_tokens, merged_draft_offsets, all_draft_logits)

Verifies draft tokens against the target model and returns merged outputs.

Parameters:

Return type:

tuple[Buffer, Buffer, Buffer]

EmbeddingsPipeline

final class max.pipelines.lib.embeddings_pipeline.EmbeddingsPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)

Bases: Pipeline[EmbeddingsGenerationInputs, EmbeddingsGenerationOutput]

Generalized token generator pipeline.

Parameters:

execute()

execute(inputs)

Processes the batch and returns embeddings.

Given a batch, executes the graph and returns the list of embedding outputs per request.

Parameters:

inputs (EmbeddingsGenerationInputs)

Return type:

dict[RequestID, EmbeddingsGenerationOutput]

release()

release(request_id)

Releases resources for the request (no-op for embeddings).

Parameters:

request_id (RequestID)

Return type:

None

Was this page helpful?