Python class
OverlapTextGenerationPipeline
OverlapTextGenerationPipeline
final class max.pipelines.lib.OverlapTextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, disable_overlap=False)
Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]
Overlap text generation pipeline.
Initialize a text generation pipeline instance.
This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
- pipeline_model (type[PipelineModel[Any]]) – Concrete model implementation to use for execution.
- eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
- weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
- tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.
- disable_overlap (bool) – When this flag is set, the overlap scheduler will immediately synchronize after model execution. This removes any potential cpu / gpu overlap.
-
Raises:
-
ValueError – If
quantization_encodingis not configured inpipeline_config.modelor if structured output is requested without a valid tokenizer delegate.
draft_kv_blocks
Returns the draft KV cache block buffers, one per DP replica.
Returns None when speculative decoding is not active.
execute()
execute(inputs)
Executes a batch of requests asynchronously on the GPU.
This method returns before the outputs for the current batch are
ready. The caller may need to call execute() again (possibly
with an empty batch) to retrieve these outputs. For example:
output_a = pipeline.execute(inputs)
assert len(outputs) == 0
output_b = pipeline.execute(empty_inputs)
assert len(outputs) == len(inputs.flat_batch)-
Parameters:
-
inputs (TextGenerationInputs[TextGenerationContextType]) – The inputs for the batch.
-
Returns:
-
A dictionary of request IDs to outputs. The outputs do not correspond to the requests in the input batch. Instead they are from the previous batch.
-
Return type:
has_pending_outputs()
has_pending_outputs()
Returns True if there are pending outputs for the previous batch.
If this is True, the caller should call execute() even with empty
inputs to retrieve the outputs for the previous batch.
-
Return type:
initialize_bitmask()
initialize_bitmask(batch)
Allocates a per-request token bitmask for structured decoding.
kv_manager
property kv_manager: PagedKVCacheManager
Returns the KV cache manager for this pipeline.
overlap_active
property overlap_active: bool
Whether CPU/GPU overlap is actually in effect.
When overlap is active, execute() defers synchronization of the
current batch until the next call, so wall-clock time measured around
execute() reflects the previous batch’s execution, not the
current one.
pipeline_config
property pipeline_config: PipelineConfig
Returns the pipeline configuration.
release()
release(request_id)
Mark the context as complete, releasing the cache slot from the KV manager.
Note: Primary KV cache lifecycle is managed by the scheduler. This method handles extra KV caches managed by the pipeline model (e.g., indexer cache for DeepSeekV3.2).
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
spec_decode_metrics()
spec_decode_metrics()
Returns the draft token acceptance metrics for speculative decoding.
-
Return type:
-
SpeculativeDecodingMetrics | None
tokenizer
property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]
Returns the tokenizer used for building contexts and decoding.
update_for_structured_output()
update_for_structured_output(context, bitmask, index)
Update context and logits bitmask for structured output.
If a json_schema is present and no matcher is set, this compiles a
grammar matcher and installs it on the context, then fills the per-request
token bitmask used to constrain the next-token distribution.
Note: Unlike TextGenerationPipeline, this does NOT support jump-ahead tokens. Jump-ahead would require re-allocating KV cache blocks after the scheduler’s alloc() call and would break CUDA graph capture (which only supports active_length=1). The bitmask constraint alone ensures valid structured output.
-
Parameters:
-
Raises:
-
ValueError – If a JSON schema is provided but structured output is not enabled via sampling configuration.
-
Return type:
-
None
warmup_graph_capture()
warmup_graph_capture()
Initializes and runs overlap device graph capture warmup.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!