Skip to main content

Python module

max.pipelines.architectures.unified_mtp_deepseekV3

DeepSeek-V3 multi-token prediction draft model for speculative decoding with unified graph compilation.

UnifiedMTPDeepseekV3Model

class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Model(*args, **kwargs)

source

Bases: DeepseekV3Model

DeepseekV3 with MTP: merge + target + rejection + shift in one graph.

execute()

execute(model_inputs)

source

Execute and return all 3 graph outputs for speculative decoding.

Parameters:

model_inputs (ModelInputs)

Return type:

UnifiedEagleOutputs

load_model()

load_model(session)

source

Load the model with the given weights.

Parameters:

session (InferenceSession)

Return type:

Model

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepares the initial inputs to be passed to execute().

The inputs and functionality can vary per model. For example, model inputs could include encoded tensors, unique IDs per tensor when using a KV cache manager, and kv_cache_inputs (or None if the model does not use KV cache). This method typically batches encoded tensors, claims a KV cache slot if needed, and returns the inputs and caches.

Parameters:

Return type:

UnifiedMTPDeepseekV3Inputs

prepare_next_token_inputs()

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the secondary inputs to be passed to execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

Parameters:

Return type:

UnifiedMTPDeepseekV3Inputs