Python class
OverlapTextGenerationPipeline
OverlapTextGenerationPipeline
final class max.pipelines.lib.OverlapTextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, disable_overlap=False)
Bases: TextGenerationPipelineInterface[TextGenerationContextType], Generic[TextGenerationContextType]
Overlap text generation pipeline.
Initialize a text generation pipeline instance.
This sets up devices, the inference session, tokenizer, KV-cache manager, sampling kernel, and loads model weights and adapters.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Configuration for the pipeline and runtime behavior.
- pipeline_model (type[PipelineModel[Any]]) – Concrete model implementation to use for execution.
- eos_token_id (int) – Default EOS token id used when HF config does not supply one or to seed the EOS set.
- weight_adapters (dict[WeightsFormat, WeightsAdapter]) – Mapping from weights format to adapter implementation.
- tokenizer (PipelineTokenizer[TextGenerationContextType, npt.NDArray[np.integer[Any]], TextGenerationRequest]) – Tokenizer implementation used to build contexts and decode.
- disable_overlap (bool) – When this flag is set, the overlap scheduler will immediately synchronize after model execution. This removes any potential cpu / gpu overlap.
-
Raises:
-
ValueError – If
quantization_encodingis not configured inpipeline_config.modelor if structured output is requested without a valid tokenizer delegate.
execute()
execute(inputs)
Executes a batch of requests asynchronously on the GPU.
This method returns before the outputs for the current batch are
ready. The caller may need to call execute() again (possibly
with an empty batch) to retrieve these outputs. For example:
output_a = pipeline.execute(inputs)
assert len(outputs) == 0
output_b = pipeline.execute(empty_inputs)
assert len(outputs) == len(inputs.flat_batch)-
Parameters:
-
inputs (TextGenerationInputs[TextGenerationContextType]) – The inputs for the batch.
-
Returns:
-
A dictionary of request IDs to outputs. The outputs do not correspond to the requests in the input batch. Instead they are from the previous batch.
-
Return type:
has_pending_outputs()
has_pending_outputs()
Returns True if there are pending outputs for the previous batch.
If this is True, the caller should call .execute() even with empty inputs to retrieve the outputs for the previous batch.
-
Return type:
kv_manager
property kv_manager: PagedKVCacheManager
Returns the KV cache manager for this pipeline.
pipeline_config
property pipeline_config: PipelineConfig
Return the pipeline configuration.
release()
release(request_id)
Mark the context as complete, releasing the cache slot from the KV manager.
Note: Primary KV cache lifecycle is managed by the scheduler. This method handles extra KV caches managed by the pipeline model (e.g., indexer cache for DeepSeekV3.2).
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
spec_decode_metrics()
spec_decode_metrics()
Returns the draft token acceptance metrics for speculative decoding.
-
Return type:
-
SpeculativeDecodingMetrics | None
tokenizer
property tokenizer: PipelineTokenizer[TextGenerationContextType, ndarray[tuple[Any, ...], dtype[integer[Any]]], TextGenerationRequest]
Return the tokenizer used for building contexts and decoding.
warmup_graph_capture()
warmup_graph_capture()
Initializes and runs overlap device graph capture warmup.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!