Python module
pipeline
MAX pipeline for model inference and generation (Text Generation variant).
BatchInfo
class max.pipelines.lib.pipeline_variants.text_generation.BatchInfo(past_seq_lens, seq_lens, num_steps)
Information about a batch of requests passed to the pipeline
num_steps
num_steps: int
Number of steps to do in the pipeline
past_seq_lens
Coordinated list of past sequence lengths (i.e. context lengths)
seq_lens
Coordinated list of sequence lengths, i.e. prompt_len or 1
TextGenerationPipeline
class max.pipelines.lib.pipeline_variants.text_generation.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)
Generalized token generator pipeline.
Initialize a text generation pipeline instance.
This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
- pipeline_model (type[PipelineModel[TextGenerationContextType]]) – Concrete model implementation to use for execution.
- eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
- weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
- tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.
-
Raises:
-
ValueError – If
quantization_encodingis not configured inpipeline_config.model_configor if structured output is requested without a valid tokenizer delegate.
calculate_num_steps()
calculate_num_steps(num_steps, context)
Compute the number of generation steps allowed for a context.
The value is clamped by the remaining capacity with respect to
the model’s configured max_seq_len.
-
Parameters:
-
- num_steps (int) – Desired number of steps to attempt.
- context (TextGenerationContextType) – The context whose sequence length constraints apply.
-
Returns:
-
The number of steps to execute for this context (>= 1).
-
Raises:
-
ValueError – If the current request length is already >=
max_seq_len. -
Return type:
execute()
execute(inputs)
Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.
-
Parameters:
-
inputs (TextGenerationInputs[TextGenerationContextType])
-
Return type:
initialize_bitmask()
initialize_bitmask(batch)
Allocate a per-request token bitmask for structured decoding.
kv_managers
Return the list of KV cache managers backing this pipeline.
pipeline_config
property pipeline_config: PipelineConfig
Return the pipeline configuration.
prepare_batch()
prepare_batch(batches, num_steps)
Prepare model inputs and ancillary state for multi-step execution.
This flattens replica batches, optionally initializes constrained
decoding bitmasks, ensures KV-cache reservations, clamps num_steps
per context, and builds initial model inputs.
-
Parameters:
-
Returns:
-
- ModelInputs: Prepared inputs for the first step.
- int: The clamped number of steps to run.
- Optional[np.ndarray]: The structured decoding bitmask or None.
- list[TextGenerationContextType]: The flattened context batch.
-
Return type:
-
A tuple of
release()
release(request_id)
Mark the context as complete, releasing the cache slot from the KV manager.
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
tokenizer
property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[int, ...], dtype[integer[Any]]], TextGenerationRequest]
Return the tokenizer used for building contexts and decoding.
update_context_and_prepare_responses()
update_context_and_prepare_responses(generated_tokens_host, batch_log_probabilities, flat_batch, num_steps, enable_log_probs)
Update the context objects and prepare the response objects for each context in the batch after generation.
-
Parameters:
-
- generated_tokens_host (ndarray[tuple[int, ...], dtype[int32]]) – Array of generated tokens on the host, indexed as [batch, step].
- batch_log_probabilities (list[list[LogProbabilities | None]]) – List of per-step log probability outputs (or None), each entry is a list per batch for that step.
- flat_batch (list[TextGenerationContextType]) – List of generation contexts, one per request, matching batch dimension.
- num_steps (int) – Number of generation steps to process for each context.
- enable_log_probs (bool) – Whether to include log probability data in outputs.
-
Returns:
-
A dictionary mapping request IDs to their respective generation outputs.
-
Return type:
update_for_structured_output()
update_for_structured_output(context, bitmask, index)
Update context and logits bitmask for structured output.
If a json_schema is present and no matcher is set, this compiles a
grammar matcher and installs it on the context. It may also jump ahead in
generation and fills the per-request token bitmask used to constrain the
next-token distribution.
-
Parameters:
-
Raises:
-
ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.
-
Return type:
-
None
SpeculativeDecodingTextGenerationPipeline
final class max.pipelines.lib.speculative_decoding.SpeculativeDecodingTextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)
Bases: Pipeline[TextGenerationInputs[TextContext], TextGenerationOutput], GenerateMixin[TextContext, TextGenerationRequest]
Generalized token generator pipeline with speculative decoding.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[TextContext]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[TextContext, npt.NDArray[np.integer[Any]], TextGenerationRequest])
build_response()
build_response(context_batch)
Build response from updated contexts.
-
Parameters:
-
- batch – The input batch dictionary mapping request IDs to contexts
- context_batch (list[TextContext]) – The list of context objects
-
Returns:
-
Dictionary mapping request IDs to TextGenerationOutput objects
-
Return type:
calculate_num_steps()
calculate_num_steps(model, huggingface_config, num_steps, context, is_draft=False)
-
Parameters:
-
- model (PipelineModel[TextContext])
- huggingface_config (AutoConfig)
- num_steps (int)
- context (TextContext)
- is_draft (bool)
-
Return type:
execute()
execute(inputs)
Provided a batch, execute both the draft model for num_steps and the target model for num_steps + 1 tokens, accepting final tokens via rejection sampling, returning the variable list of token integers.
-
Parameters:
-
inputs (TextGenerationInputs[TextContext])
-
Return type:
generate_draft_tokens()
generate_draft_tokens(batch, num_steps, model_inputs)
-
Parameters:
-
- batch (list[TextContext])
- num_steps (int)
- model_inputs (ModelInputs)
-
Return type:
kv_managers
property kv_managers: list[PagedKVCacheManager]
metrics
property metrics: SpeculativeDecodingMetrics
Get the current speculative decoding metrics.
-
Returns:
-
The SpeculativeDecodingMetrics instance with current statistics
pipeline_config
property pipeline_config: PipelineConfig
prepare_batch()
prepare_batch(model, batch, num_steps, return_n_logits, is_draft=False, draft_inputs=None, merged_draft_tokens=None, merged_draft_offsets=None)
-
Parameters:
-
- model (PipelineModel[TextContext])
- batch (list[TextContext])
- num_steps (int)
- return_n_logits (int)
- is_draft (bool)
- draft_inputs (ModelInputs | None)
- merged_draft_tokens (Tensor | None)
- merged_draft_offsets (Tensor | None)
-
Return type:
release()
release(request_id)
Releases resources associated with this request ID.
-
Parameters:
-
request_id (RequestID) – Unique identifier for the finished request.
-
Return type:
-
None
sample_draft_logits()
sample_draft_logits(batch, model_outputs, prev_tokens, prev_logits, top_k, max_k, temperature, top_p, min_top_p, seed)
tokenizer
property tokenizer: PipelineTokenizer[TextContext, ndarray[tuple[int, ...], dtype[integer[Any]]], TextGenerationRequest]
update_contexts()
update_contexts(context_batch, first_rejected_tokens, recovered_tokens, bonus_tokens, draft_tokens, num_draft_tokens_generated)
Update contexts with the results of token generation.
-
Parameters:
-
- context_batch (list[TextContext]) – The list of context objects
- first_rejected_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – Array indicating the indices of first rejected tokens
- sampled_target_tokens – Array of sampled tokens from the target model
- draft_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]]) – Array of draft tokens
- num_draft_tokens_generated (int) – Number of tokens generated by the draft model
- recovered_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])
- bonus_tokens (ndarray[tuple[int, ...], dtype[integer[Any]]])
-
Return type:
-
None
verify_draft_tokens_with_target_model()
verify_draft_tokens_with_target_model(draft_inputs, context_batch, num_draft_tokens_generated, draft_tokens, draft_logits, merged_draft_tokens, merged_draft_offsets, all_draft_logits)
EmbeddingsPipeline
final class max.pipelines.lib.embeddings_pipeline.EmbeddingsPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)
Bases: Pipeline[EmbeddingsGenerationInputs, EmbeddingsGenerationOutput]
Generalized token generator pipeline.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[EmbeddingsContext]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[BaseContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest])
execute()
execute(inputs)
Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.
-
Parameters:
-
inputs (EmbeddingsGenerationInputs)
-
Return type:
release()
release(request_id)
Release any resources or state associated with a specific request.
This method should be implemented by concrete pipeline classes to perform cleanup or resource deallocation for the given request ID. It is typically called when a request has completed processing and its associated resources (such as memory, cache, or temporary files) are no longer needed.
-
Parameters:
-
request_id (RequestID) – The unique identifier of the request to release resources for.
-
Returns:
-
None
-
Raises:
-
NotImplementedError – If not implemented by a concrete subclass.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!