Skip to main content

Python class

TextGenerationPipeline

TextGenerationPipeline

class max.pipelines.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)

source

Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]

Generalized token generator pipeline.

Initialize a text generation pipeline instance.

This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.

Parameters:

  • pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
  • pipeline_model (type[PipelineModel[TextGenerationContextType]]) – Concrete model implementation to use for execution.
  • eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
  • weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
  • tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.

Raises:

ValueError – If quantization_encoding is not configured in pipeline_config.model or if structured output is requested without a valid tokenizer delegate.

execute()

execute(inputs)

source

Processes the batch and returns decoded tokens.

Given a batch, executes the graph for num_steps in a multi-step scenario, then decodes the tokens and returns the list of decoded tokens.

Parameters:

inputs (TextGenerationInputs[TextGenerationContextType])

Return type:

dict[RequestID, TextGenerationOutput]

initialize_bitmask()

initialize_bitmask(batch)

source

Allocates a per-request token bitmask for structured decoding.

Parameters:

batch (list[TextGenerationContextType]) – The generation contexts for the batch.

Returns:

A bitmask array of shape [batch_size, vocab_size] if structured output is enabled; otherwise None.

Return type:

ndarray[tuple[Any, …], dtype[int32]] | None

kv_manager

property kv_manager: PagedKVCacheManager

source

Returns the KV cache manager for this pipeline.

pipeline_config

property pipeline_config: PipelineConfig

source

Return the pipeline configuration.

prepare_batch()

prepare_batch(batches, num_steps)

source

Prepare model inputs and ancillary state for multi-step execution.

This flattens replica batches, optionally initializes constrained decoding bitmasks, ensures KV-cache reservations, clamps num_steps per context, and builds initial model inputs.

Parameters:

  • batches (list[list[TextGenerationContextType]]) – Per-replica list of contexts.
  • num_steps (int) – Desired number of steps to run.

Returns:

  • ModelInputs: Prepared inputs for the first step.
  • int: The clamped number of steps to run.
  • Optional[np.ndarray]: The structured decoding bitmask or None.
  • list[TextGenerationContextType]: The flattened context batch.

Return type:

A tuple of

release()

release(request_id)

source

Release model-specific resources for a completed request.

Primary and extra KV cache lifecycle is managed by the batch constructor. This method handles model-specific cleanup only (e.g. vision encoder cache).

Parameters:

request_id (RequestID)

Return type:

None

tokenizer

property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]

source

Return the tokenizer used for building contexts and decoding.

update_for_structured_output()

update_for_structured_output(context, bitmask, index)

source

Update context and logits bitmask for structured output.

If a json_schema is present and no matcher is set, this compiles a grammar matcher and installs it on the context. It may also jump ahead in generation and fills the per-request token bitmask used to constrain the next-token distribution.

Parameters:

  • context (TextGenerationContextType) – Request context to update.
  • bitmask (ndarray[tuple[Any, ...], dtype[int32]]) – Optional preallocated bitmask buffer; updated in-place.
  • index (int) – Global position into the bitmask for this request.

Raises:

ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.

Return type:

None