IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

TextGenerationPipeline

TextGenerationPipeline​

class max.pipelines.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer)

source

Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]

Generalized token generator pipeline.

Initialize a text generation pipeline instance.

This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.

Parameters:

Raises:

ValueError – If quantization_encoding is not configured in pipeline_config.model or if structured output is requested without a valid tokenizer delegate.

execute()​

execute(inputs)

source

Processes the batch and returns decoded tokens.

Executes the graph for a single decode step, samples the next token, then decodes and returns the generated tokens.

Parameters:

inputs (TextGenerationInputs[TextGenerationContextType])

Return type:

dict[RequestID, TextGenerationOutput]

initialize_bitmask()​

initialize_bitmask(batch)

source

Allocates a per-request token bitmask for structured decoding.

Parameters:

batch (list[TextGenerationContextType]) – The generation contexts for the batch.

Returns:

A bitmask array of shape [batch_size, vocab_size] if structured output is enabled; otherwise None.

Return type:

ndarray[tuple[Any, …], dtype[int32]] | None

kv_manager​

property kv_manager: PagedKVCacheManager

source

Returns the KV cache manager for this pipeline.

pipeline_config​

property pipeline_config: PipelineConfig

source

Return the pipeline configuration.

prepare_batch()​

prepare_batch(batches, num_steps)

source

Prepare model inputs and ancillary state for execution.

This flattens replica batches, optionally initializes constrained decoding bitmasks, ensures KV-cache reservations, and builds initial model inputs.

Parameters:

  • batches (list[list[TextGenerationContextType]]) – Per-replica list of contexts.
  • num_steps (int) – Number of decode steps reserved in the KV cache.

Returns:

  • ModelInputs: Prepared inputs for the step.
  • Optional[np.ndarray]: The structured decoding bitmask or None.
  • list[TextGenerationContextType]: The flattened context batch.

Return type:

A tuple of

release()​

release(request_id)

source

Release model-specific resources for a completed request.

Primary and extra KV cache lifecycle is managed by the batch constructor. This method handles model-specific cleanup only (e.g. vision encoder cache).

Parameters:

request_id (RequestID)

Return type:

None

tokenizer​

property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]

source

Return the tokenizer used for building contexts and decoding.

update_for_structured_output()​

update_for_structured_output(context, bitmask, index)

source

Update context and logits bitmask for structured output.

If a json_schema is present and no matcher is set, this compiles a grammar matcher and installs it on the context. It may also jump ahead in generation and fills the per-request token bitmask used to constrain the next-token distribution.

Parameters:

  • context (TextGenerationContextType) – Request context to update.
  • bitmask (ndarray[tuple[Any, ...], dtype[int32]]) – Optional preallocated bitmask buffer; updated in-place.
  • index (int) – Global position into the bitmask for this request.

Raises:

InputError – If a JSON schema is provided but structured output is not enabled via sampling configuration.

Return type:

None