IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.unified_mtp_deepseekV3

DeepSeek-V3 multi-token prediction draft model for speculative decoding with unified graph compilation.

UnifiedMTPDeepseekV3Inputs​

class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, draft_tokens=None, draft_kv_blocks=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, pinned_bitmask=None, wait_payload=None, device_bitmask_scratch=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())

source

Bases: DeepseekV3Inputs

Inputs for the UnifiedMTPDeepseekV3 model.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

device_bitmask_scratch​

device_bitmask_scratch: Buffer | None = None

source

Device scratch buffer that receives the in-graph H2D from pinned_bitmask; the acceptance sampler reads from it. Only set when structured output is enabled.

draft_kv_blocks​

draft_kv_blocks: list[Buffer] | None = None

source

draft_tokens​

draft_tokens: Buffer | None = None

source

in_thinking_phase​

in_thinking_phase: Buffer | None = None

source

Per-batch bool flag marking rows currently inside a <think>...</think> block; consumed by relaxed acceptance.

max_k​

max_k: Buffer | None = None

source

min_top_p​

min_top_p: Buffer | None = None

source

pinned_bitmask​

pinned_bitmask: Buffer | None = None

source

Pinned host bitmask for constrained decoding.

Shape [batch_size, num_speculative_tokens + 1, vocab_size]. Position i contains the valid-token mask given the FSM state after consuming draft[0:i-1]; position num_speculative_tokens is for the bonus token. None when structured output is disabled.

seed​

seed: Buffer | None = None

source

temperature​

temperature: Buffer | None = None

source

top_k​

top_k: Buffer | None = None

source

top_p​

top_p: Buffer | None = None

source

wait_payload​

wait_payload: Buffer | None = None

source

CPU int64[2] payload = [flag._unsafe_ptr, 1] consumed by the in-graph mo.wait_host_value_with_dep op. Only set when structured output is enabled.

UnifiedMTPDeepseekV3Model​

class max.pipelines.architectures.unified_mtp_deepseekV3.UnifiedMTPDeepseekV3Model(*args, **kwargs)

source

Bases: DeepseekV3Model

DeepseekV3 with MTP: merge + target + rejection + shift in one graph.

execute()​

execute(model_inputs)

source

Execute and return all 3 graph outputs for speculative decoding.

Parameters:

model_inputs (ModelInputs)

Return type:

UnifiedEagleOutputs

load_model()​

load_model(session)

source

Load the model with the given weights.

Parameters:

session (InferenceSession)

Return type:

Model

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, draft_tokens=None, draft_kv_cache_buffers=None, **kwargs)

source

Prepares the initial inputs to be passed to execute().

The inputs and functionality can vary per model. For example, model inputs could include encoded tensors, unique IDs per tensor when using a KV cache manager, and kv_cache_inputs (or None if the model does not use KV cache). This method typically batches encoded tensors, claims a KV cache slot if needed, and returns the inputs and caches.

Parameters:

Return type:

UnifiedMTPDeepseekV3Inputs