Skip to main content

Python module

max.pipelines.architectures.deepseekV3_nextn

DeepSeek-V3 NextN multi-token prediction draft model for speculative decoding.

DeepseekV3NextNConfig

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNConfig(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None)

source

Bases: DeepseekV3Config

Configuration for DeepseekV3 NextN model.

The NextN (Next-N token prediction) model is a single-layer decoder that takes both input embeddings and hidden states from a base model as input, concatenates them, and processes through a single decoder layer to predict the next token.

Parameters:

  • dtype (DType)
  • kv_params (KVCacheParamInterface)
  • devices (list[DeviceRef])
  • use_subgraphs (bool)
  • data_parallel_degree (int)
  • vocab_size (int)
  • hidden_size (int)
  • intermediate_size (int)
  • moe_intermediate_size (int)
  • moe_layer_freq (int)
  • num_hidden_layers (int)
  • num_attention_heads (int)
  • num_key_value_heads (int)
  • n_shared_experts (int)
  • n_routed_experts (int)
  • routed_scaling_factor (float)
  • kv_lora_rank (int)
  • q_lora_rank (int)
  • qk_rope_head_dim (int)
  • v_head_dim (int)
  • qk_nope_head_dim (int)
  • topk_method (str)
  • n_group (int)
  • topk_group (int)
  • num_experts_per_tok (int)
  • first_k_dense_replace (int)
  • norm_topk_prob (bool)
  • hidden_act (str)
  • max_position_embeddings (int)
  • max_seq_len (int)
  • rms_norm_eps (float)
  • tie_word_embeddings (bool)
  • rope_theta (float)
  • rope_scaling (dict[str, Any] | None)
  • rope_interleave (bool)
  • scoring_func (str)
  • attention_bias (bool)
  • attention_dropout (float)
  • norm_dtype (DType)
  • gate_dtype (DType | None)
  • correction_bias_dtype (DType | None)
  • max_batch_context_length (int)
  • quant_config (QuantConfig | None)
  • ep_config (EPConfig | None)
  • graph_mode (str)
  • return_logits (ReturnLogits)
  • return_hidden_states (ReturnHiddenStates)
  • eagle_aux_hidden_state_layer_ids (list[int] | None)

construct_kv_params()

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Get KV cache parameters for the NextN model.

The NextN model has only a single decoder layer, so we only need to cache one layer’s worth of KV pairs.

Parameters:

Return type:

KVCacheParams

get_num_layers()

static get_num_layers(huggingface_config)

source

NextN only has a single decoder layer.

Parameters:

huggingface_config (AutoConfig)

Return type:

int

DeepseekV3NextNModel

class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE, shared_weights=None, shared_ep_comm_initializer=None)

source

Bases: AlwaysSignalBuffersMixin, DeepseekV2Model

Parameters:

estimate_weights_size()

classmethod estimate_weights_size(pipeline_config)

source

Calculates the estimated memory consumption of the DeepseekV3 NextN model.

The NextN model consists of:

  • embed_tokens: VocabParallelEmbedding (shared in EAGLE/MTP mode)
  • lm_head: ColumnParallelLinear (shared in EAGLE/MTP mode)
  • enorm, hnorm, shared_head_norm: RMSNorm layers
  • eh_proj: Linear layer (hidden_size * 2 -> hidden_size)
  • decoder_layer: Single DeepseekV3DecoderLayer (MoE layer)

Parameters:

pipeline_config (PipelineConfig) – The pipeline configuration containing model settings.

Returns:

Estimated weight memory in bytes.

Return type:

int

execute()

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

get_kv_params()

classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Returns the KV cache params for the pipeline model.

Parameters:

Return type:

KVCacheParams

load_model()

load_model(session)

source

Load the NextN model with the given weights.

Parameters:

session (InferenceSession)

Return type:

Model

prepare_initial_token_inputs()

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, hidden_states=None)

source

Prepare initial inputs for the NextN model.

Parameters:

  • replica_batches (Sequence[Sequence[TextContext]]) – Batches of text contexts per replica
  • kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None) – KV cache inputs (optional)
  • return_n_logits (int) – Number of logits to return
  • hidden_states (Buffer | None) – Hidden states from the base or draft model

Returns:

NextN model inputs

Return type:

DeepseekV3NextNInputs

prepare_next_token_inputs()

prepare_next_token_inputs(next_tokens, prev_model_inputs, hidden_states=None)

source

Prepare inputs for next token generation.

Parameters:

  • next_tokens (Buffer) – Next tokens to process
  • prev_model_inputs (ModelInputs) – Previous model inputs
  • hidden_states (Buffer | None) – Hidden states from the base model (optional, will use hidden_states from prev_model_inputs if not provided)

Returns:

NextN model inputs for next token

Return type:

DeepseekV3NextNInputs