IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.deepseekV3_nextn

DeepSeek-V3 NextN multi-token prediction draft model for speculative decoding.

DeepseekV3NextNConfig

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNConfig(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, dense_mlp_layers_without_quant=frozenset({}), ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None)

source

Bases: DeepseekV3Config

Configuration for DeepseekV3 NextN model.

The NextN (Next-N token prediction) model is a single-layer decoder that takes both input embeddings and hidden states from a base model as input, concatenates them, and processes through a single decoder layer to predict the next token.

Parameters:

  • dtype (DType)
  • kv_params (KVCacheParamInterface)
  • devices (list[DeviceRef])
  • use_subgraphs (bool)
  • data_parallel_degree (int)
  • vocab_size (int)
  • hidden_size (int)
  • intermediate_size (int)
  • moe_intermediate_size (int)
  • moe_layer_freq (int)
  • num_hidden_layers (int)
  • num_attention_heads (int)
  • num_key_value_heads (int)
  • n_shared_experts (int)
  • n_routed_experts (int)
  • routed_scaling_factor (float)
  • kv_lora_rank (int)
  • q_lora_rank (int)
  • qk_rope_head_dim (int)
  • v_head_dim (int)
  • qk_nope_head_dim (int)
  • topk_method (str)
  • n_group (int)
  • topk_group (int)
  • num_experts_per_tok (int)
  • first_k_dense_replace (int)
  • norm_topk_prob (bool)
  • hidden_act (str)
  • max_position_embeddings (int)
  • max_seq_len (int)
  • rms_norm_eps (float)
  • tie_word_embeddings (bool)
  • rope_theta (float)
  • rope_scaling (dict[str, Any] | None)
  • rope_interleave (bool)
  • scoring_func (str)
  • attention_bias (bool)
  • attention_dropout (float)
  • norm_dtype (DType)
  • gate_dtype (DType | None)
  • correction_bias_dtype (DType | None)
  • max_batch_context_length (int)
  • quant_config (QuantConfig | None)
  • dense_mlp_layers_without_quant (frozenset[int])
  • ep_config (EPConfig | None)
  • graph_mode (str)
  • return_logits (ReturnLogits)
  • return_hidden_states (ReturnHiddenStates)
  • eagle_aux_hidden_state_layer_ids (list[int] | None)

construct_kv_params()

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Get KV cache parameters for the NextN model.

The NextN model has only a single decoder layer, so we only need to cache one layer’s worth of KV pairs.

Parameters:

Return type:

KVCacheParams

get_num_layers()

static get_num_layers(huggingface_config)

source

NextN only has a single decoder layer.

Parameters:

huggingface_config (AutoConfig)

Return type:

int

DeepseekV3NextNInputs

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNInputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())

source

Bases: DeepseekV3Inputs

A class representing inputs for the DeepseekV3 NextN model.

Inherits from DeepseekV3Inputs so that the target model’s isinstance check passes during EAGLE verification (when draft_inputs is passed to the target).

Parameters:

hidden_states

hidden_states: Buffer | None = None

source

Hidden states for a variable number of tokens per sequence.

For data parallel models, this can be a list of Buffers where each Buffer contains hidden states for the sequences assigned to that device.

DeepseekV3NextNModel

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE, shared_weights=None, shared_ep_comm_initializer=None)

source

Bases: AlwaysSignalBuffersMixin, DeepseekV2Model

Parameters:

estimate_weights_size()

classmethod estimate_weights_size(pipeline_config)

source

Calculates the estimated memory consumption of the DeepseekV3 NextN model.

The NextN model consists of:

  • embed_tokens: VocabParallelEmbedding (shared in EAGLE/MTP mode)
  • lm_head: ColumnParallelLinear (shared in EAGLE/MTP mode)
  • enorm, hnorm, shared_head_norm: RMSNorm layers
  • eh_proj: Linear layer (hidden_size * 2 -> hidden_size)
  • decoder_layer: Single DeepseekV3DecoderLayer (MoE layer)

Parameters:

pipeline_config (PipelineConfig) – The pipeline configuration containing model settings.

Returns:

Estimated weight memory in bytes.

Return type:

int

execute()

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

load_model()

load_model(session)

source

Load the NextN model with the given weights.

Parameters:

session (InferenceSession)

Return type:

Model

model_config_cls

model_config_cls

source

alias of DeepseekV3NextNConfig

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, hidden_states=None)

source

Prepare initial inputs for the NextN model.

Parameters:

  • replica_batches (Sequence[Sequence[TextContext]]) – Batches of text contexts per replica
  • kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None) – KV cache inputs (optional)
  • return_n_logits (int) – Number of logits to return
  • hidden_states (Buffer | None) – Hidden states from the base or draft model

Returns:

NextN model inputs

Return type:

DeepseekV3NextNInputs