Python module
max.pipelines.architectures.deepseekV2
DeepSeek-V2 mixture-of-experts architecture for text generation.
DeepseekV2Config
class max.pipelines.architectures.deepseekV2.DeepseekV2Config(*, dtype, kv_params, devices, vocab_size=102400, hidden_size=4096, intermediate_size=11008, moe_intermediate_size=1407, num_hidden_layers=30, num_attention_heads=32, num_key_value_heads=32, n_shared_experts=0, n_routed_experts=0, ep_size=1, routed_scaling_factor=1.0, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=0, topk_group=0, num_experts_per_tok=0, moe_layer_freq=1, first_k_dense_replace=0, norm_topk_prob=False, scoring_func='softmax', aux_loss_alpha=0.001, seq_aux=True, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, pad_token_id=None, bos_token_id=100000, eos_token_id=100001, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, max_batch_context_length=131072, graph_mode='auto')
Bases: ArchConfigWithKVCache
Configuration for DeepseekV2 models.
-
Parameters:
-
- dtype (DType)
- kv_params (KVCacheParams)
- devices (list[DeviceRef])
- vocab_size (int)
- hidden_size (int)
- intermediate_size (int)
- moe_intermediate_size (int)
- num_hidden_layers (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- n_shared_experts (int)
- n_routed_experts (int)
- ep_size (int)
- routed_scaling_factor (float)
- kv_lora_rank (int)
- q_lora_rank (int | None)
- qk_rope_head_dim (int)
- v_head_dim (int)
- qk_nope_head_dim (int)
- topk_method (str)
- n_group (int)
- topk_group (int)
- num_experts_per_tok (int)
- moe_layer_freq (int)
- first_k_dense_replace (int)
- norm_topk_prob (bool)
- scoring_func (str)
- aux_loss_alpha (float)
- seq_aux (bool)
- hidden_act (str)
- max_position_embeddings (int)
- initializer_range (float)
- rms_norm_eps (float)
- use_cache (bool)
- pad_token_id (int | None)
- bos_token_id (int)
- eos_token_id (int)
- pretraining_tp (int)
- tie_word_embeddings (bool)
- rope_theta (float)
- rope_scaling (dict[str, Any] | None)
- attention_bias (bool)
- attention_dropout (float)
- max_batch_context_length (int)
- graph_mode (str)
attention_bias
attention_bias: bool = False
attention_dropout
attention_dropout: float = 0.0
aux_loss_alpha
aux_loss_alpha: float = 0.001
bos_token_id
bos_token_id: int = 100000
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devices
dtype
dtype: DType
eos_token_id
eos_token_id: int = 100001
ep_size
ep_size: int = 1
first_k_dense_replace
first_k_dense_replace: int = 0
get_kv_params()
get_kv_params()
KV cache parameters to use when running the model.
-
Return type:
get_max_seq_len()
get_max_seq_len()
Returns the default maximum sequence length for the model.
Subclasses should determine whether this value can be overridden by
setting the --max-length (pipeline_config.model.max_length) flag.
-
Return type:
get_num_layers()
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
graph_mode
graph_mode: str = 'auto'
hidden_act
hidden_act: str = 'silu'
hidden_size
hidden_size: int = 4096
initialize()
classmethod initialize(pipeline_config, model_config=None)
Initialize the config from a PipelineConfig.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The pipeline configuration.
- model_config (MAXModelConfig | None) – The model configuration to read from. When
None(the default),pipeline_config.modelis used. Pass an explicit config (e.g.pipeline_config.draft_model) to initialize the arch config for a different model.
-
Return type:
initializer_range
initializer_range: float = 0.02
intermediate_size
intermediate_size: int = 11008
kv_lora_rank
kv_lora_rank: int = 512
kv_params
kv_params: KVCacheParams
max_batch_context_length
max_batch_context_length: int = 131072
max_position_embeddings
max_position_embeddings: int = 2048
moe_intermediate_size
moe_intermediate_size: int = 1407
moe_layer_freq
moe_layer_freq: int = 1
n_group
n_group: int = 0
n_routed_experts
n_routed_experts: int = 0
n_shared_experts
n_shared_experts: int = 0
norm_topk_prob
norm_topk_prob: bool = False
num_attention_heads
num_attention_heads: int = 32
num_experts_per_tok
num_experts_per_tok: int = 0
num_hidden_layers
num_hidden_layers: int = 30
num_key_value_heads
num_key_value_heads: int = 32
pad_token_id
pretraining_tp
pretraining_tp: int = 1
q_lora_rank
qk_nope_head_dim
qk_nope_head_dim: int = 128
qk_rope_head_dim
qk_rope_head_dim: int = 64
rms_norm_eps
rms_norm_eps: float = 1e-06
rope_scaling
rope_theta
rope_theta: float = 10000.0
routed_scaling_factor
routed_scaling_factor: float = 1.0
scoring_func
scoring_func: str = 'softmax'
seq_aux
seq_aux: bool = True
tie_word_embeddings
tie_word_embeddings: bool = False
topk_group
topk_group: int = 0
topk_method
topk_method: str = 'greedy'
use_cache
use_cache: bool = True
v_head_dim
v_head_dim: int = 128
vocab_size
vocab_size: int = 102400
DeepseekV2Inputs
class max.pipelines.architectures.deepseekV2.DeepseekV2Inputs(tokens, input_row_offsets, signal_buffers, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits)
Bases: ModelInputs
A class representing inputs for the DeepseekV2 model.
This class encapsulates the input tensors required for the DeepseekV2 model execution:
- tokens: A tensor containing the input token IDs
- input_row_offsets: A tensor containing the offsets for each row in the ragged input sequence
- return_n_logits: A tensor containing the number of logits to return
-
Parameters:
input_row_offsets
input_row_offsets: Buffer
return_n_logits
return_n_logits: Buffer
signal_buffers
Device buffers used for synchronization in communication collectives.
tokens
tokens: Buffer
DeepseekV2Model
class max.pipelines.architectures.deepseekV2.DeepseekV2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE)
Bases: LogProbabilitiesMixin, PipelineModelWithKVCache[TextContext]
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
calculate_max_seq_len()
classmethod calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the optimal max sequence length for the model.
Models are expected to implement this method. The following example shows how to implement it for a Mistral model:
class MistralModel(PipelineModel):
@classmethod
def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int:
try:
return upper_bounded_default(
upper_bound=huggingface_config.max_seq_len,
default=pipeline_config.model.max_length,
)
except ValueError as e:
raise ValueError(
"Unable to infer max_length for Mistral, the provided "
f"max_length ({pipeline_config.model.max_length}) exceeds the "
f"model's max_seq_len ({huggingface_config.max_seq_len})."
) from e-
Parameters:
-
- pipeline_config (PipelineConfig) – Configuration for the pipeline.
- huggingface_config (AutoConfig) – Hugging Face model configuration.
-
Returns:
-
The maximum sequence length to use.
-
Return type:
execute()
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipeline’s output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
get_kv_params()
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
graph_inputs()
graph_inputs()
-
Return type:
-
tuple[TensorType | BufferType, …]
load_model()
load_model(session)
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
-
Parameters:
-
- replica_batches (Sequence[Sequence[TextContext]])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- return_n_logits (int)
-
Return type:
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!