Skip to main content

Python module

max.pipelines.architectures.unified_mtp_deepseekV3

DeepSeek-V3 multi-token prediction draft model for speculative decoding with unified graph compilation.

UnifiedMTPDeepseekV3Inputs​

class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, draft_tokens=None, draft_kv_blocks=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())

source

Bases: DeepseekV3Inputs

Inputs for the UnifiedMTPDeepseekV3 model.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

draft_kv_blocks​

draft_kv_blocks: list[Buffer] | None = None

source

draft_tokens​

draft_tokens: Buffer | None = None

source

in_thinking_phase​

in_thinking_phase: Buffer | None = None

source

Per-batch bool flag marking rows currently inside a <think>...</think> block; consumed by relaxed acceptance.

max_k​

max_k: Buffer | None = None

source

min_top_p​

min_top_p: Buffer | None = None

source

Per-batch sampling parameters consumed by the stochastic acceptance sampler. max_k and min_top_p are 0-d CPU scalars; the rest are [batch_size] tensors on the primary device.

seed​

seed: Buffer | None = None

source

Per-execute int64 scalar seed consumed by the stochastic acceptance sampler (and, when enabled, the synthetic benchmarking sampler).

temperature​

temperature: Buffer | None = None

source

top_k​

top_k: Buffer | None = None

source

top_p​

top_p: Buffer | None = None

source

UnifiedMTPDeepseekV3Model​

class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Model(*args, **kwargs)

source

Bases: DeepseekV3Model

DeepseekV3 with MTP: merge + target + rejection + shift in one graph.

execute()​

execute(model_inputs)

source

Execute and return all 3 graph outputs for speculative decoding.

Parameters:

model_inputs (ModelInputs)

Return type:

UnifiedEagleOutputs

load_model()​

load_model(session)

source

Load the model with the given weights.

Parameters:

session (InferenceSession)

Return type:

Model

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, draft_tokens=None, draft_kv_cache_buffers=None, **kwargs)

source

Prepares the initial inputs to be passed to execute().

The inputs and functionality can vary per model. For example, model inputs could include encoded tensors, unique IDs per tensor when using a KV cache manager, and kv_cache_inputs (or None if the model does not use KV cache). This method typically batches encoded tensors, claims a KV cache slot if needed, and returns the inputs and caches.

Parameters:

Return type:

UnifiedMTPDeepseekV3Inputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the secondary inputs to be passed to execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

Parameters:

Return type:

UnifiedMTPDeepseekV3Inputs