Python module
pipeline
MAX pipeline for model inference and generation (Text Generation variant).
BatchInfo
class max.pipelines.lib.pipeline_variants.text_generation.BatchInfo(past_seq_lens, seq_lens, num_steps)
Information about a batch of requests passed to the pipeline.
num_steps
num_steps: int
Number of steps to do in the pipeline
past_seq_lens
Coordinated list of past sequence lengths (i.e. context lengths)
seq_lens
Coordinated list of sequence lengths, i.e. prompt_len or 1
TextGenerationPipeline
class max.pipelines.lib.pipeline_variants.text_generation.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)
Generalized token generator pipeline.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[TextGenerationContextType]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest])
execute()
execute(inputs)
Processes the batch and returns decoded tokens.
Given a batch, executes the graph for num_steps in a multi-step scenario, then decodes the tokens and returns the list of decoded tokens.
-
Parameters:
-
inputs (TextGenerationInputs[TextGenerationContextType])
-
Return type:
initialize_bitmask()
initialize_bitmask(batch)
Allocates a per-request token bitmask for structured decoding.
kv_managers
Return the list of KV cache managers backing this pipeline.
pipeline_config
property pipeline_config: PipelineConfig
Return the pipeline configuration.
prepare_batch()
prepare_batch(batches, num_steps)
Prepare model inputs and ancillary state for multi-step execution.
This flattens replica batches, optionally initializes constrained
decoding bitmasks, ensures KV-cache reservations, clamps num_steps
per context, and builds initial model inputs.
-
Parameters:
-
Returns:
-
- ModelInputs: Prepared inputs for the first step.
- int: The clamped number of steps to run.
- Optional[np.ndarray]: The structured decoding bitmask or None.
- list[TextGenerationContextType]: The flattened context batch.
-
Return type:
-
A tuple of
release()
release(request_id)
Mark the context as complete, releasing the cache slot from the KV manager.
Note: KV cache lifecycle is now managed by the scheduler. This method is kept for interface compatibility but is a no-op for regular pipelines.
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
tokenizer
property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]
Return the tokenizer used for building contexts and decoding.
update_for_structured_output()
update_for_structured_output(context, bitmask, index)
Update context and logits bitmask for structured output.
If a json_schema is present and no matcher is set, this compiles a
grammar matcher and installs it on the context. It may also jump ahead in
generation and fills the per-request token bitmask used to constrain the
next-token distribution.
-
Parameters:
-
Raises:
-
ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.
-
Return type:
-
None
TextGenerationPipelineInterface
class max.pipelines.lib.pipeline_variants.text_generation.TextGenerationPipelineInterface(*args, **kwargs)
Interface for text generation pipelines.
StandaloneSpeculativeDecodingPipeline
final class max.pipelines.lib.speculative_decoding.StandaloneSpeculativeDecodingPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, draft_pipeline_model=None, draft_weight_adapters=None)
Bases: SpeculativeDecodingPipelineBase
Standalone speculative decoding where draft model runs independently.
In this approach, the draft model generates tokens without any information from the target model, then the target model verifies these tokens.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[TextContext]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[TextContext, npt.NDArray[np.integer[Any]], TextGenerationRequest])
- draft_pipeline_model (type[PipelineModel[TextContext]] | None)
- draft_weight_adapters (dict[WeightsFormat, WeightsAdapter] | None)
execute()
execute(inputs)
Execute standalone speculative decoding.
In standalone mode:
- Draft model generates tokens independently
- Target model verifies draft tokens
- Apply rejection sampling to accept/reject tokens
-
Parameters:
-
inputs (TextGenerationInputs[TextContext])
-
Return type:
generate_draft_tokens()
generate_draft_tokens(batch, num_steps, model_inputs)
Generates draft tokens for the batch using the draft model.
-
Parameters:
-
- batch (list[TextContext])
- num_steps (int)
- model_inputs (ModelInputs)
-
Return type:
prepare_batch()
prepare_batch(model, batch, replica_batches, num_steps, return_n_logits, is_draft=False, draft_inputs=None, merged_draft_tokens=None, merged_draft_offsets=None)
Prepares batch inputs and KV cache for draft or target model.
-
Parameters:
-
- model (PipelineModel[TextContext])
- batch (list[TextContext])
- replica_batches (list[list[TextContext]])
- num_steps (int)
- return_n_logits (int)
- is_draft (bool)
- draft_inputs (ModelInputs | None)
- merged_draft_tokens (Buffer | None)
- merged_draft_offsets (Buffer | None)
-
Return type:
verify_draft_tokens_with_target_model()
verify_draft_tokens_with_target_model(draft_inputs, context_batch, replica_batches, num_draft_tokens_generated, draft_tokens, draft_logits, merged_draft_tokens, merged_draft_offsets, all_draft_logits)
Verifies draft tokens against the target model and returns merged outputs.
-
Parameters:
-
- draft_inputs (ModelInputs)
- context_batch (list[TextContext])
- replica_batches (list[list[TextContext]])
- num_draft_tokens_generated (int)
- draft_tokens (Buffer)
- draft_logits (Buffer)
- merged_draft_tokens (Buffer)
- merged_draft_offsets (Buffer)
- all_draft_logits (Buffer)
-
Return type:
EmbeddingsPipeline
final class max.pipelines.lib.embeddings_pipeline.EmbeddingsPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)
Bases: Pipeline[EmbeddingsGenerationInputs, EmbeddingsGenerationOutput]
Generalized token generator pipeline.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[EmbeddingsContext]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[BaseContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest])
execute()
execute(inputs)
Processes the batch and returns embeddings.
Given a batch, executes the graph and returns the list of embedding outputs per request.
-
Parameters:
-
inputs (EmbeddingsGenerationInputs)
-
Return type:
release()
release(request_id)
Releases resources for the request (no-op for embeddings).
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!