Python module
max.pipelines.architectures.deepseekV3
DeepSeek-V3 mixture-of-experts architecture for text generation.
DeepseekV3Config
class max.pipelines.architectures.deepseekV3.DeepseekV3Config(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None)
Bases: ArchConfigWithKVCache
Configuration for DeepseekV3 models.
-
Parameters:
-
- dtype (DType)
- kv_params (KVCacheParamInterface)
- devices (list[DeviceRef])
- use_subgraphs (bool)
- data_parallel_degree (int)
- vocab_size (int)
- hidden_size (int)
- intermediate_size (int)
- moe_intermediate_size (int)
- moe_layer_freq (int)
- num_hidden_layers (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- n_shared_experts (int)
- n_routed_experts (int)
- routed_scaling_factor (float)
- kv_lora_rank (int)
- q_lora_rank (int)
- qk_rope_head_dim (int)
- v_head_dim (int)
- qk_nope_head_dim (int)
- topk_method (str)
- n_group (int)
- topk_group (int)
- num_experts_per_tok (int)
- first_k_dense_replace (int)
- norm_topk_prob (bool)
- hidden_act (str)
- max_position_embeddings (int)
- max_seq_len (int)
- rms_norm_eps (float)
- tie_word_embeddings (bool)
- rope_theta (float)
- rope_scaling (dict[str, Any] | None)
- rope_interleave (bool)
- scoring_func (str)
- attention_bias (bool)
- attention_dropout (float)
- norm_dtype (DType)
- gate_dtype (DType | None)
- correction_bias_dtype (DType | None)
- max_batch_context_length (int)
- quant_config (QuantConfig | None)
- ep_config (EPConfig | None)
- graph_mode (str)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
- eagle_aux_hidden_state_layer_ids (list[int] | None)
attention_bias
attention_bias: bool = False
attention_dropout
attention_dropout: float = 0.0
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
correction_bias_dtype
data_parallel_degree
data_parallel_degree: int = 1
devices
dtype
dtype: DType
eagle_aux_hidden_state_layer_ids
Optional explicit hidden-state capture layer ids for EAGLE3.
ep_config
ep_config: EPConfig | None = None
first_k_dense_replace
first_k_dense_replace: int = 3
gate_dtype
get_kv_params()
get_kv_params()
KV cache parameters to use when running the model.
-
Return type:
get_max_seq_len()
get_max_seq_len()
Returns the default maximum sequence length for the model.
Subclasses should determine whether this value can be overridden by
setting the --max-length (pipeline_config.model.max_length) flag.
-
Return type:
get_num_layers()
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
graph_mode
graph_mode: str = 'auto'
hidden_act
hidden_act: str = 'silu'
hidden_size
hidden_size: int = 7168
initialize()
classmethod initialize(pipeline_config, model_config=None)
Initializes a DeepseekV3Config instance from pipeline configuration.
This method creates a config instance with all fields that can be determined from the pipeline configuration, without needing the state_dict. Fields that depend on the state_dict (like norm_dtype, quant_config, etc.) should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
An initialized DeepseekV3Config instance.
-
Return type:
intermediate_size
intermediate_size: int = 18432
kv_lora_rank
kv_lora_rank: int = 512
kv_params
kv_params: KVCacheParamInterface
max_batch_context_length
max_batch_context_length: int = 131072
max_position_embeddings
max_position_embeddings: int = 4096
Maximum positional embeddings as defined by the original model.
max_seq_len
max_seq_len: int = 163840
Maximum sequence length as defined by the MAX Engine pipeline configuration.
moe_intermediate_size
moe_intermediate_size: int = 2048
moe_layer_freq
moe_layer_freq: int = 1
n_group
n_group: int = 8
n_routed_experts
n_routed_experts: int = 256
n_shared_experts
n_shared_experts: int = 1
norm_dtype
norm_dtype: DType = 80
norm_topk_prob
norm_topk_prob: bool = True
num_attention_heads
num_attention_heads: int = 128
num_experts_per_tok
num_experts_per_tok: int = 8
num_hidden_layers
num_hidden_layers: int = 61
num_key_value_heads
num_key_value_heads: int = 128
q_lora_rank
q_lora_rank: int = 1536
qk_nope_head_dim
qk_nope_head_dim: int = 128
qk_rope_head_dim
qk_rope_head_dim: int = 64
quant_config
quant_config: QuantConfig | None = None
return_hidden_states
return_hidden_states: ReturnHiddenStates = 'none'
Whether to return hidden states and which type (none, last, all, last_normalized, all_normalized).
return_logits
return_logits: ReturnLogits = 'last_token'
Whether to return the last token, all logits, or a variable number of logits.
rms_norm_eps
rms_norm_eps: float = 1e-06
rope_interleave
rope_interleave: bool = True
rope_scaling
rope_theta
rope_theta: float = 10000.0
routed_scaling_factor
routed_scaling_factor: float = 2.5
scoring_func
scoring_func: str = 'sigmoid'
tie_word_embeddings
tie_word_embeddings: bool = False
topk_group
topk_group: int = 4
topk_method
topk_method: str = 'greedy'
use_subgraphs
use_subgraphs: bool = True
v_head_dim
v_head_dim: int = 128
vocab_size
vocab_size: int = 129280
DeepseekV3Inputs
class max.pipelines.architectures.deepseekV3.DeepseekV3Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())
Bases: DeepseekV2Inputs
A class representing inputs for the DeepseekV3 model.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (Buffer)
- signal_buffers (list[Buffer])
- host_input_row_offsets (Buffer)
- batch_context_lengths (list[Buffer])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
- return_n_logits (Buffer)
- data_parallel_splits (Buffer)
- ep_inputs (tuple[Buffer, ...])
batch_context_lengths
List of tensors containing the context length of each batch.
buffers
Returns positional Buffer inputs for model ABI calls.
data_parallel_splits
data_parallel_splits: Buffer
Tensor containing the data parallel splits for the MLA layer.
ep_inputs
Expert parallel communication buffers (atomic counters and device pointers).
host_input_row_offsets
host_input_row_offsets: Buffer
Tensor containing the host input row offsets.
DeepseekV3Model
class max.pipelines.architectures.deepseekV3.DeepseekV3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE)
Bases: AlwaysSignalBuffersMixin, DeepseekV2Model
A DeepseekV3 model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
estimate_activation_memory()
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for model execution.
This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Pipeline configuration
- huggingface_config (AutoConfig) – HuggingFace model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
estimate_weights_size()
classmethod estimate_weights_size(pipeline_config)
Calculates the estimated memory consumption of our model.
-
Parameters:
-
pipeline_config (PipelineConfig)
-
Return type:
execute()
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipeline’s output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
get_kv_params()
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
load_model()
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextContext]])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- return_n_logits (int)
-
Return type:
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!