Python class
StandaloneSpeculativeDecodingPipeline
StandaloneSpeculativeDecodingPipeline
final class max.pipelines.lib.StandaloneSpeculativeDecodingPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters, tokenizer, draft_pipeline_model=None, draft_weight_adapters=None)
Bases: SpeculativeDecodingPipelineBase
Standalone speculative decoding where draft model runs independently.
In this approach, the draft model generates tokens without any information from the target model, then the target model verifies these tokens.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- pipeline_model (type[PipelineModel[TextContext]])
- eos_token_id (int)
- weight_adapters (dict[WeightsFormat, WeightsAdapter])
- tokenizer (PipelineTokenizer[TextContext, npt.NDArray[np.integer[Any]], TextGenerationRequest])
- draft_pipeline_model (type[PipelineModel[TextContext]] | None)
- draft_weight_adapters (dict[WeightsFormat, WeightsAdapter] | None)
execute()
execute(inputs)
Executes standalone speculative decoding.
In standalone mode:
- Draft model generates tokens independently
- Target model verifies draft tokens
- Apply rejection sampling to accept/reject tokens
-
Parameters:
-
inputs (TextGenerationInputs[TextContext])
-
Return type:
generate_draft_tokens()
generate_draft_tokens(batch, num_steps, model_inputs)
Generates draft tokens for the batch using the draft model.
-
Parameters:
-
- batch (list[TextContext])
- num_steps (int)
- model_inputs (ModelInputs)
-
Return type:
prepare_batch()
prepare_batch(model, batch, replica_batches, return_n_logits, is_draft=False, draft_inputs=None, merged_draft_tokens=None, merged_draft_offsets=None)
Prepares batch inputs and KV cache for draft or target model.
-
Parameters:
-
- model (PipelineModel[TextContext])
- batch (list[TextContext])
- replica_batches (list[list[TextContext]])
- return_n_logits (int)
- is_draft (bool)
- draft_inputs (ModelInputs | None)
- merged_draft_tokens (Buffer | None)
- merged_draft_offsets (Buffer | None)
-
Return type:
spec_decode_metrics()
spec_decode_metrics()
Returns the draft token acceptance metrics for speculative decoding.
-
Return type:
-
SpeculativeDecodingMetrics
verify_draft_tokens_with_target_model()
verify_draft_tokens_with_target_model(draft_inputs, context_batch, replica_batches, num_draft_tokens_generated, draft_tokens, draft_logits, merged_draft_tokens, merged_draft_offsets, all_draft_logits)
Verifies draft tokens against the target model and returns merged outputs.
-
Parameters:
-
- draft_inputs (ModelInputs)
- context_batch (list[TextContext])
- replica_batches (list[list[TextContext]])
- num_draft_tokens_generated (int)
- draft_tokens (Buffer)
- draft_logits (Buffer)
- merged_draft_tokens (Buffer)
- merged_draft_offsets (Buffer)
- all_draft_logits (Buffer | None)
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!