IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.hy_v3

Tencent Hunyuan Hy3-preview (HYV3ForCausalLM).

HYV3Config​

class max.pipelines.architectures.hy_v3.HYV3Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, num_local_experts=192, num_experts_per_tok=8, moe_intermediate_size=1536, num_shared_experts=1, router_scaling_factor=2.826, route_norm=True, first_k_dense_replace=1, intermediate_size_dense=13312, correction_bias_dtype=None, gate_dtype=None, ep_config=None)

source

Bases: Llama3Config

Hy3-preview decoder-only MoE config.

Parameters:

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

1 / sqrt(head_dim) β€” standard scaled-dot-product.

Parameters:

huggingface_config (AutoConfig)

Return type:

float

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Construct KV cache params using the explicit head_dim.

With data_parallel_degree=1 and n_devices>1, KVCacheParams.__post_init__ tensor-parallel-shards the paged KV cache to n_kv_heads_per_device = num_key_value_heads // n_devices (e.g. 8 // 4 = 2). Hy3 attention is TP-sharded to match: each device’s shard owns num_attention_heads // n_devices Q heads and num_key_value_heads // n_devices KV heads β€” exactly the KV heads resident in that device’s slice of the cache, so flash_attention_ragged reads device-local KV heads consistently (see layers/attention.py).

Constraint: num_key_value_heads (8) must be divisible by n_devices for the per-device KV-head count to be a whole number, so valid attention TP for Hy3 is 1, 2, 4, or 8.

Parameters:

Return type:

KVCacheParams

correction_bias_dtype​

correction_bias_dtype: DType | None = None

source

ep_config​

ep_config: EPConfig | None = None

source

first_k_dense_replace​

first_k_dense_replace: int = 1

source

gate_dtype​

gate_dtype: DType | None = None

source

get_num_layers()​

static get_num_layers(huggingface_config)

source

Layer count for the decoder stack (override when HF uses a different field).

Parameters:

huggingface_config (AutoConfig)

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Parameters:

Return type:

Self

intermediate_size_dense​

intermediate_size_dense: int = 13312

source

moe_intermediate_size​

moe_intermediate_size: int = 1536

source

num_experts_per_tok​

num_experts_per_tok: int = 8

source

num_local_experts​

num_local_experts: int = 192

source

num_shared_experts​

num_shared_experts: int = 1

source

route_norm​

route_norm: bool = True

source

router_scaling_factor​

router_scaling_factor: float = 2.826

source

HYV3Inputs​

class max.pipelines.architectures.hy_v3.HYV3Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, lora_grouped_offsets=None, num_active_loras=None, lora_end_idx=None, batch_seq_len=None, lora_ids_kv=None, lora_grouped_offsets_kv=None, data_parallel_splits=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, ep_inputs=(), host_input_row_offsets=None)

source

Bases: Llama3Inputs

Inputs with EP and DP support.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

ep_inputs​

ep_inputs: tuple[Buffer, ...] = ()

source

host_input_row_offsets​

host_input_row_offsets: Buffer | None = None

source

HYV3Model​

class max.pipelines.architectures.hy_v3.HYV3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

Hy3-preview pipeline model.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

estimate_activation_memory()​

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

source

Estimates the activation memory required for model execution.

This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.

The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.

Parameters:

  • pipeline_config (PipelineConfig) – Pipeline configuration
  • huggingface_config (AutoConfig) – Hugging Face model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

load_model()​

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model​

model: Model

source

Compiled and initialized model ready for inference.

model_config_cls​

model_config_cls

source

alias of HYV3Config

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

HYV3Inputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepare the inputs for the next token in multistep execution. This should avoid any device synchronization or copy operations.

Parameters:

Return type:

HYV3Inputs

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.