Skip to main content

Python class

OverlapTextGenerationPipeline

OverlapTextGenerationPipeline

final class max.pipelines.lib.OverlapTextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, disable_overlap=False)

source

Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]

Overlap text generation pipeline.

Initialize a text generation pipeline instance.

This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.

Parameters:

  • pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
  • pipeline_model (type[PipelineModel[Any]]) – Concrete model implementation to use for execution.
  • eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
  • weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
  • tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.
  • disable_overlap (bool) – When this flag is set, the overlap scheduler will immediately synchronize after model execution. This removes any potential cpu / gpu overlap.

Raises:

ValueError – If quantization_encoding is not configured in pipeline_config.model or if structured output is requested without a valid tokenizer delegate.

draft_kv_blocks

property draft_kv_blocks: list[Buffer] | None

source

Returns the draft KV cache block buffers, one per DP replica.

Returns None when speculative decoding is not active.

execute()

execute(inputs)

source

Executes a batch of requests asynchronously on the GPU.

This method returns before the outputs for the current batch are ready. The caller may need to call execute() again (possibly with an empty batch) to retrieve these outputs. For example:

output_a = pipeline.execute(inputs)
assert len(outputs) == 0

output_b = pipeline.execute(empty_inputs)
assert len(outputs) == len(inputs.flat_batch)

Parameters:

inputs (TextGenerationInputs[TextGenerationContextType]) – The inputs for the batch.

Returns:

A dictionary of request IDs to outputs. The outputs do not correspond to the requests in the input batch. Instead they are from the previous batch.

Return type:

dict[RequestID, TextGenerationOutput]

has_pending_outputs()

has_pending_outputs()

source

Returns True if there are pending outputs for the previous batch.

If this is True, the caller should call execute() even with empty inputs to retrieve the outputs for the previous batch.

Return type:

bool

initialize_bitmask()

initialize_bitmask(batch)

source

Allocates a per-request token bitmask for structured decoding.

Parameters:

batch (list[TextGenerationContextType]) – The generation contexts for the batch.

Returns:

A bitmask array of shape [batch_size, vocab_size] if structured output is enabled; otherwise None.

Return type:

ndarray[tuple[Any, …], dtype[int32]] | None

kv_manager

property kv_manager: PagedKVCacheManager

source

Returns the KV cache manager for this pipeline.

overlap_active

property overlap_active: bool

source

Whether CPU/GPU overlap is actually in effect.

When overlap is active, execute() defers synchronization of the current batch until the next call, so wall-clock time measured around execute() reflects the previous batch’s execution, not the current one.

pipeline_config

property pipeline_config: PipelineConfig

source

Returns the pipeline configuration.

release()

release(request_id)

source

Mark the context as complete, releasing the cache slot from the KV manager.

Note: Primary KV cache lifecycle is managed by the scheduler. This method handles extra KV caches managed by the pipeline model (e.g., indexer cache for DeepSeekV3.2).

Parameters:

request_id (RequestID)

Return type:

None

spec_decode_metrics()

spec_decode_metrics()

source

Returns the draft token acceptance metrics for speculative decoding.

Return type:

SpeculativeDecodingMetrics | None

tokenizer

property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]

source

Returns the tokenizer used for building contexts and decoding.

update_for_structured_output()

update_for_structured_output(context, bitmask, index)

source

Update context and logits bitmask for structured output.

If a json_schema is present and no matcher is set, this compiles a grammar matcher and installs it on the context, then fills the per-request token bitmask used to constrain the next-token distribution.

Note: Unlike TextGenerationPipeline, this does NOT support jump-ahead tokens. Jump-ahead would require re-allocating KV cache blocks after the scheduler’s alloc() call and would break CUDA graph capture (which only supports active_length=1). The bitmask constraint alone ensures valid structured output.

Parameters:

  • context (TextGenerationContextType) – Request context to update.
  • bitmask (ndarray[tuple[Any, ...], dtype[int32]]) – Optional preallocated bitmask buffer; updated in-place.
  • index (int) – Global position into the bitmask for this request.

Raises:

ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.

Return type:

None

warmup_graph_capture()

warmup_graph_capture()

source

Initializes and runs overlap device graph capture warmup.

Return type:

None