Python module
max.pipelines.architectures.unified_mtp_deepseekV3
DeepSeek-V3 multi-token prediction draft model for speculative decoding with unified graph compilation.
UnifiedMTPDeepseekV3Model
class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Model(*args, **kwargs)
Bases: DeepseekV3Model
DeepseekV3 with MTP: merge + target + rejection + shift in one graph.
execute()
execute(model_inputs)
Execute and return all 3 graph outputs for speculative decoding.
-
Parameters:
-
model_inputs (ModelInputs)
-
Return type:
-
UnifiedEagleOutputs
load_model()
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextContext]])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- return_n_logits (int)
-
Return type:
-
UnifiedMTPDeepseekV3Inputs
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
-
UnifiedMTPDeepseekV3Inputs
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!