Skip to main content

Python class

OverlapTextGenerationPipeline

OverlapTextGenerationPipeline

final class max.pipelines.lib.OverlapTextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, disable_overlap=False)

source

Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]

Overlap text generation pipeline.

Initialize a text generation pipeline instance.

This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.

Parameters:

  • pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
  • pipeline_model (type[PipelineModel[Any]]) – Concrete model implementation to use for execution.
  • eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
  • weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
  • tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.
  • disable_overlap (bool) – When this flag is set, the overlap scheduler will immediately synchronize after model execution. This removes any potential cpu / gpu overlap.

Raises:

ValueError – If quantization_encoding is not configured in pipeline_config.model or if structured output is requested without a valid tokenizer delegate.

execute()

execute(inputs)

source

Executes a batch of requests asynchronously on the GPU.

This method returns before the outputs for the current batch are ready. The caller may need to call execute() again (possibly with an empty batch) to retrieve these outputs. For example:

output_a = pipeline.execute(inputs)
assert len(outputs) == 0

output_b = pipeline.execute(empty_inputs)
assert len(outputs) == len(inputs.flat_batch)

Parameters:

inputs (TextGenerationInputs[TextGenerationContextType]) – The inputs for the batch.

Returns:

A dictionary of request IDs to outputs. The outputs do not correspond to the requests in the input batch. Instead they are from the previous batch.

Return type:

dict[RequestID, TextGenerationOutput]

has_pending_outputs()

has_pending_outputs()

source

Returns True if there are pending outputs for the previous batch.

If this is True, the caller should call .execute() even with empty inputs to retrieve the outputs for the previous batch.

Return type:

bool

kv_manager

property kv_manager: PagedKVCacheManager

source

Returns the KV cache manager for this pipeline.

pipeline_config

property pipeline_config: PipelineConfig

source

Return the pipeline configuration.

release()

release(request_id)

source

Mark the context as complete, releasing the cache slot from the KV manager.

Note: Primary KV cache lifecycle is managed by the scheduler. This method handles extra KV caches managed by the pipeline model (e.g., indexer cache for DeepSeekV3.2).

Parameters:

request_id (RequestID)

Return type:

None

spec_decode_metrics()

spec_decode_metrics()

source

Returns the draft token acceptance metrics for speculative decoding.

Return type:

SpeculativeDecodingMetrics | None

tokenizer

property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]

source

Return the tokenizer used for building contexts and decoding.

warmup_graph_capture()

warmup_graph_capture()

source

Initializes and runs overlap device graph capture warmup.

Return type:

None