For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.deepseekV3_nextn

DeepSeek-V3 NextN multi-token prediction draft model for speculative decoding.

`DeepseekV3NextNConfig`

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNConfig(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, dense_mlp_layers_without_quant=frozenset({}), ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None, eplb_profile_enabled=False)

source

Bases: DeepseekV3Config

Configuration for DeepseekV3 NextN model.

The NextN (Next-N token prediction) model is a single-layer decoder that takes both input embeddings and hidden states from a base model as input, concatenates them, and processes through a single decoder layer to predict the next token.

Parameters:

dtype (DType)
kv_params (KVCacheParamInterface)
devices (list[DeviceRef])
use_subgraphs (bool)
data_parallel_degree (int)
vocab_size (int)
hidden_size (int)
intermediate_size (int)
moe_intermediate_size (int)
moe_layer_freq (int)
num_hidden_layers (int)
num_attention_heads (int)
num_key_value_heads (int)
n_shared_experts (int)
n_routed_experts (int)
routed_scaling_factor (float)
kv_lora_rank (int)
q_lora_rank (int)
qk_rope_head_dim (int)
v_head_dim (int)
qk_nope_head_dim (int)
topk_method (str)
n_group (int)
topk_group (int)
num_experts_per_tok (int)
first_k_dense_replace (int)
norm_topk_prob (bool)
hidden_act (str)
max_position_embeddings (int)
max_seq_len (int)
rms_norm_eps (float)
tie_word_embeddings (bool)
rope_theta (float)
rope_scaling (dict[str, Any] | None)
rope_interleave (bool)
scoring_func (str)
attention_bias (bool)
attention_dropout (float)
norm_dtype (DType)
gate_dtype (DType | None)
correction_bias_dtype (DType | None)
max_batch_context_length (int)
quant_config (QuantConfig | None)
dense_mlp_layers_without_quant (frozenset[int])
ep_config (EPConfig | None)
graph_mode (str)
return_logits (ReturnLogits)
return_hidden_states (ReturnHiddenStates)
eagle_aux_hidden_state_layer_ids (list[int] | None)
eplb_profile_enabled (bool)

`construct_kv_params()`

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Get KV cache parameters for the NextN model.

The NextN model has only a single decoder layer, so we only need to cache one layer’s worth of KV pairs.

Parameters:

huggingface_config (AutoConfig)
pipeline_config (PipelineConfig)
devices (list[DeviceRef])
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

KVCacheParams

`get_num_layers()`

static get_num_layers(huggingface_config)

source

NextN only has a single decoder layer.

Parameters:: huggingface_config (AutoConfig)
Return type:: int

`DeepseekV3NextNInputs`

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNInputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, *, kv_cache_inputs=None, lora=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())

source

Bases: DeepseekV3Inputs

A class representing inputs for the DeepseekV3 NextN model.

Inherits from DeepseekV3Inputs so that the target model’s isinstance check passes during EAGLE verification (when draft_inputs is passed to the target).

Parameters:

tokens (Buffer)
input_row_offsets (Buffer)
signal_buffers (list[Buffer])
host_input_row_offsets (Buffer)
batch_context_lengths (list[Buffer])
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
lora (LoRAInputs | None)
hidden_states (Buffer | None)
return_n_logits (Buffer)
data_parallel_splits (Buffer)
ep_inputs (tuple[Buffer, ...])

`hidden_states`

hidden_states: Buffer | None = None

source

Hidden states for a variable number of tokens per sequence.

For data parallel models, this can be a list of Buffers where each Buffer contains hidden states for the sequences assigned to that device.

`DeepseekV3NextNModel`

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE, shared_weights=None, shared_ep_comm_initializer=None, max_batch_size=1)

source

Bases: AlwaysSignalBuffersMixin, DeepseekV2Model

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)
return_hidden_states (ReturnHiddenStates)
shared_weights (dict[str, DLPackArray] | None)
shared_ep_comm_initializer (EPCommInitializer | None)
max_batch_size (int)

`batch_processor_cls`

batch_processor_cls

source

alias of DeepseekV3NextNBatchProcessor

`execute()`

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:: model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
Returns:: ModelOutputs containing the pipeline’s output tensors.
Return type:: ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

`model_config_cls`

model_config_cls

source

alias of DeepseekV3NextNConfig

DeepseekV3NextNConfig​

construct_kv_params()​

get_num_layers()​

DeepseekV3NextNInputs​

hidden_states​

DeepseekV3NextNModel​

batch_processor_cls​

execute()​

model_config_cls​

`DeepseekV3NextNConfig`

`construct_kv_params()`

`get_num_layers()`

`DeepseekV3NextNInputs`

`hidden_states`

`DeepseekV3NextNModel`

`batch_processor_cls`

`execute()`

`model_config_cls`