Python module
max.pipelines.architectures.deepseekV3_nextn
DeepSeek-V3 NextN multi-token prediction draft model for speculative decoding.
DeepseekV3NextNConfig
class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNConfig(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None)
Bases: DeepseekV3Config
Configuration for DeepseekV3 NextN model.
The NextN (Next-N token prediction) model is a single-layer decoder that takes both input embeddings and hidden states from a base model as input, concatenates them, and processes through a single decoder layer to predict the next token.
-
Parameters:
-
- dtype (DType)
- kv_params (KVCacheParamInterface)
- devices (list[DeviceRef])
- use_subgraphs (bool)
- data_parallel_degree (int)
- vocab_size (int)
- hidden_size (int)
- intermediate_size (int)
- moe_intermediate_size (int)
- moe_layer_freq (int)
- num_hidden_layers (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- n_shared_experts (int)
- n_routed_experts (int)
- routed_scaling_factor (float)
- kv_lora_rank (int)
- q_lora_rank (int)
- qk_rope_head_dim (int)
- v_head_dim (int)
- qk_nope_head_dim (int)
- topk_method (str)
- n_group (int)
- topk_group (int)
- num_experts_per_tok (int)
- first_k_dense_replace (int)
- norm_topk_prob (bool)
- hidden_act (str)
- max_position_embeddings (int)
- max_seq_len (int)
- rms_norm_eps (float)
- tie_word_embeddings (bool)
- rope_theta (float)
- rope_scaling (dict[str, Any] | None)
- rope_interleave (bool)
- scoring_func (str)
- attention_bias (bool)
- attention_dropout (float)
- norm_dtype (DType)
- gate_dtype (DType | None)
- correction_bias_dtype (DType | None)
- max_batch_context_length (int)
- quant_config (QuantConfig | None)
- ep_config (EPConfig | None)
- graph_mode (str)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
- eagle_aux_hidden_state_layer_ids (list[int] | None)
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Get KV cache parameters for the NextN model.
The NextN model has only a single decoder layer, so we only need to cache one layer’s worth of KV pairs.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
get_num_layers()
static get_num_layers(huggingface_config)
NextN only has a single decoder layer.
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
DeepseekV3NextNModel
class max.pipelines.architectures.deepseekV3_nextn.DeepseekV3NextNModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE, shared_weights=None, shared_ep_comm_initializer=None)
Bases: AlwaysSignalBuffersMixin, DeepseekV2Model
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
- shared_weights (dict[str, DLPackArray] | None)
- shared_ep_comm_initializer (EPCommInitializer | None)
estimate_weights_size()
classmethod estimate_weights_size(pipeline_config)
Calculates the estimated memory consumption of the DeepseekV3 NextN model.
The NextN model consists of:
- embed_tokens: VocabParallelEmbedding (shared in EAGLE/MTP mode)
- lm_head: ColumnParallelLinear (shared in EAGLE/MTP mode)
- enorm, hnorm, shared_head_norm: RMSNorm layers
- eh_proj: Linear layer (hidden_size * 2 -> hidden_size)
- decoder_layer: Single DeepseekV3DecoderLayer (MoE layer)
-
Parameters:
-
pipeline_config (PipelineConfig) – The pipeline configuration containing model settings.
-
Returns:
-
Estimated weight memory in bytes.
-
Return type:
execute()
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipeline’s output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
get_kv_params()
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
load_model()
load_model(session)
Load the NextN model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, hidden_states=None)
Prepare initial inputs for the NextN model.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextContext]]) – Batches of text contexts per replica
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None) – KV cache inputs (optional)
- return_n_logits (int) – Number of logits to return
- hidden_states (Buffer | None) – Hidden states from the base or draft model
-
Returns:
-
NextN model inputs
-
Return type:
-
DeepseekV3NextNInputs
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs, hidden_states=None)
Prepare inputs for next token generation.
-
Parameters:
-
- next_tokens (Buffer) – Next tokens to process
- prev_model_inputs (ModelInputs) – Previous model inputs
- hidden_states (Buffer | None) – Hidden states from the base model (optional, will use hidden_states from prev_model_inputs if not provided)
-
Returns:
-
NextN model inputs for next token
-
Return type:
-
DeepseekV3NextNInputs
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!