Python module
max.pipelines.architectures.kimik2_5
Kimi K2.5 mixture-of-experts architecture for text generation.
KimiK2_5Configβ
class max.pipelines.architectures.kimik2_5.KimiK2_5Config(*, devices, dtype, bos_token_id, eos_token_id, ignore_index, media_placeholder_token_id, pad_token_id, tie_word_embeddings, use_unified_vision_chunk, video_placeholder, vision_config, llm_config)
Bases: ArchConfigWithKVAndVisionCache
Configuration for Kimi-K2.5 models.
-
Parameters:
-
- devices (list[DeviceRef])
- dtype (DType)
- bos_token_id (int)
- eos_token_id (int)
- ignore_index (int)
- media_placeholder_token_id (int)
- pad_token_id (int)
- tie_word_embeddings (bool)
- use_unified_vision_chunk (bool | None)
- video_placeholder (str | None)
- vision_config (VisionConfig)
- llm_config (KimiK2_5TextConfig)
bos_token_idβ
bos_token_id: int
ID of the beginning-of-sequence (BOS) token.
calculate_max_seq_len()β
static calculate_max_seq_len(pipeline_config, huggingface_config)
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()β
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devicesβ
Devices that the Kimi-K2.5 model is parallelized over.
dtypeβ
dtype: DType
DType of the Kimi-K2.5 model weights.
eos_token_idβ
eos_token_id: int
ID of the end-of-sequence (EOS) token.
estimate_vision_cache_entry_bytes()β
static estimate_vision_cache_entry_bytes(huggingface_config)
Estimate per-entry bytes for the vision encoder cache.
Max tokens per image = pos_emb_height * pos_emb_width / merge_sq, multiplied by the text hidden size and 2 bytes (bfloat16).
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
get_kv_params()β
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()β
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
ignore_indexβ
ignore_index: int
Index that should be ignored when calculating loss (e.g., for padding).
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes a Qwen3VLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
A Qwen3VLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, huggingface_config, llm_config=None)
Initializes a KimiK2_5Config from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) β HuggingFace model configuration.
- llm_config (KimiK2_5TextConfig | None) β Pre-initialized DeepseekV3 configuration.
-
Returns:
-
A KimiK2_5Config instance ready for finalization.
-
Return type:
llm_configβ
llm_config: KimiK2_5TextConfig
Language model configuration using DeepseekV3 architecture.
media_placeholder_token_idβ
media_placeholder_token_id: int
Token ID used as a placeholder for media (e.g., images, video frames) within sequences.
pad_token_idβ
pad_token_id: int
Token ID used for padding sequences to uniform length.
tie_word_embeddingsβ
tie_word_embeddings: bool
Whether to share (tie) the input and output word embeddings in the language model.
use_unified_vision_chunkβ
Whether to use a unified chunk for vision inputs.
video_placeholderβ
Placeholder string used to represent video segments in input text.
vision_configβ
vision_config: VisionConfig
Vision encoder configuration.
KimiK2_5Modelβ
class max.pipelines.architectures.kimik2_5.KimiK2_5Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[KimiK2_5TextAndVisionContext]
A Kimi-K2.5 pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
calculate_max_seq_len()β
classmethod calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the optimal max sequence length for the model.
Models are expected to implement this method. The following example shows how to implement it for a Mistral model:
class MistralModel(PipelineModel):
@classmethod
def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int:
try:
return upper_bounded_default(
upper_bound=huggingface_config.max_seq_len,
default=pipeline_config.model.max_length,
)
except ValueError as e:
raise ValueError(
"Unable to infer max_length for Mistral, the provided "
f"max_length ({pipeline_config.model.max_length}) exceeds the "
f"model's max_seq_len ({huggingface_config.max_seq_len})."
) from e-
Parameters:
-
- pipeline_config (PipelineConfig) β Configuration for the pipeline.
- huggingface_config (AutoConfig) β Hugging Face model configuration.
-
Returns:
-
The maximum sequence length to use.
-
Return type:
estimate_activation_memory()β
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for model execution.
This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.
-
Parameters:
-
- pipeline_config (PipelineConfig) β Pipeline configuration
- huggingface_config (AutoConfig) β HuggingFace model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
estimate_weights_size()β
classmethod estimate_weights_size(pipeline_config)
Calculates the estimated memory consumption of our model.
-
Parameters:
-
pipeline_config (PipelineConfig)
-
Return type:
execute()β
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) β The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipelineβs output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
get_kv_params()β
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_modelβ
language_model: Model
The compiled language model for text generation.
load_model()β
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
modelβ
property model: Model
Expose language model for graph capture/replay.
Only the language model is captured since vision runs during prefill
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
-
Parameters:
-
Return type:
prepare_next_token_inputs()β
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
release()β
release(request_id)
Release vision encoder cache entries for a completed request.
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
vision_modelβ
vision_model: Model
The compiled vision model for processing images.
KimiK2_5ModelInputsβ
class max.pipelines.architectures.kimik2_5.KimiK2_5ModelInputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, image_token_indices=None, precomputed_image_embeddings=None, pixel_values=None, grid_thws=None, cu_seqlens=None, max_seqlen=None, vision_position_ids=None, language_image_embeddings=<factory>, language_image_token_indices=<factory>, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())
Bases: DeepseekV3Inputs
A class representing inputs for the KimiK2_5M model.
This class encapsulates the input tensors required for the KimiK2_5M model execution, including both text and vision inputs. Vision inputs are optional and can be None for text-only processing.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (Buffer)
- signal_buffers (list[Buffer])
- host_input_row_offsets (Buffer)
- batch_context_lengths (list[Buffer])
- image_token_indices (list[Buffer] | None)
- precomputed_image_embeddings (list[Buffer] | None)
- pixel_values (list[Buffer] | None)
- grid_thws (list[Buffer] | None)
- cu_seqlens (list[Buffer] | None)
- max_seqlen (list[Buffer] | None)
- vision_position_ids (list[Buffer] | None)
- language_image_embeddings (list[Buffer])
- language_image_token_indices (list[Buffer])
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
- return_n_logits (Buffer)
- data_parallel_splits (Buffer)
- ep_inputs (tuple[Buffer, ...])
buffersβ
Returns the language model input ABI tuple.
cu_seqlensβ
Cumulative sequence lengths for full attention per device.
grid_thwsβ
Grid dimensions (temporal, height, width) for each image/video, shape (n_images, 3) per device.
has_vision_inputsβ
property has_vision_inputs: bool
Check if this input contains vision data.
image_token_indicesβ
Per-device pre-computed multimodal merge indices for the image embeddings.
These are the locations of the image_token_id in the inputs fed to the model.
Some indices may be negative, which means that they are ignored by the multimodal merge.
language_image_embeddingsβ
Per-device image embeddings for the language model graph. Shape [0, hidden_size] during decode, [num_patches, hidden_size] during prefill.
language_image_token_indicesβ
Per-device scatter indices for the language model graph. Shape [0] during decode, [num_image_tokens] during prefill.
max_seqlenβ
Maximum sequence length for full attention for vision inputs per device.
pixel_valuesβ
Pixel values for vision inputs.
precomputed_image_embeddingsβ
Pre-computed image embeddings from VisionEncoderCache.
vision_position_idsβ
Vision rotary position IDs per device.
KimiK2_5ReasoningParserβ
class max.pipelines.architectures.kimik2_5.KimiK2_5ReasoningParser(think_start_token_id, think_end_token_id, tool_section_start_token_id=None)
Bases: ReasoningParser
Kimi K2.5 reasoning parser for
Reasoning may end implicitly when a tool call section begins
(<|tool_calls_section_begin|>).
Reasoning may begin implicitly, without an explicit
Reasoning can be disabled through the chat template by including a token in the prompt.
-
Parameters:
from_tokenizer()β
async classmethod from_tokenizer(tokenizer)
Construct a reasoning parser from a tokenizer.
-
Parameters:
-
tokenizer (PipelineTokenizer[Any, Any, Any])
-
Return type:
stream()β
stream(delta_token_ids)
Identify a reasoning span within a streaming delta chunk.
-
Parameters:
-
Return type:
KimiK2_5TextConfigβ
class max.pipelines.architectures.kimik2_5.KimiK2_5TextConfig(*, dtype: 'DType', kv_params: 'KVCacheParamInterface', devices: 'list[DeviceRef]', use_subgraphs: 'bool' = True, data_parallel_degree: 'int' = 1, vocab_size: 'int' = 129280, hidden_size: 'int' = 7168, intermediate_size: 'int' = 18432, moe_intermediate_size: 'int' = 2048, moe_layer_freq: 'int' = 1, num_hidden_layers: 'int' = 61, num_attention_heads: 'int' = 128, num_key_value_heads: 'int' = 128, n_shared_experts: 'int' = 1, n_routed_experts: 'int' = 256, routed_scaling_factor: 'float' = 2.5, kv_lora_rank: 'int' = 512, q_lora_rank: 'int' = 1536, qk_rope_head_dim: 'int' = 64, v_head_dim: 'int' = 128, qk_nope_head_dim: 'int' = 128, topk_method: 'str' = 'greedy', n_group: 'int' = 8, topk_group: 'int' = 4, num_experts_per_tok: 'int' = 8, first_k_dense_replace: 'int' = 3, norm_topk_prob: 'bool' = True, hidden_act: 'str' = 'silu', max_position_embeddings: 'int' = 4096, max_seq_len: 'int' = 163840, rms_norm_eps: 'float' = 1e-06, tie_word_embeddings: 'bool' = False, rope_theta: 'float' = 10000.0, rope_scaling: 'dict[str, Any] | None' = None, rope_interleave: 'bool' = True, scoring_func: 'str' = 'sigmoid', attention_bias: 'bool' = False, attention_dropout: 'float' = 0.0, norm_dtype: 'DType' = bfloat16, gate_dtype: 'DType | None' = None, correction_bias_dtype: 'DType | None' = None, max_batch_context_length: 'int' = 131072, quant_config: 'QuantConfig | None' = None, ep_config: 'EPConfig | None' = None, graph_mode: 'str' = 'auto', return_logits: 'ReturnLogits' = <ReturnLogits.LAST_TOKEN: 'last_token'>, return_hidden_states: 'ReturnHiddenStates' = <ReturnHiddenStates.NONE: 'none'>, eagle_aux_hidden_state_layer_ids: 'list[int] | None' = None)
Bases: DeepseekV3Config
-
Parameters:
-
- dtype (DType)
- kv_params (KVCacheParamInterface)
- devices (list[DeviceRef])
- use_subgraphs (bool)
- data_parallel_degree (int)
- vocab_size (int)
- hidden_size (int)
- intermediate_size (int)
- moe_intermediate_size (int)
- moe_layer_freq (int)
- num_hidden_layers (int)
- num_attention_heads (int)
- num_key_value_heads (int)
- n_shared_experts (int)
- n_routed_experts (int)
- routed_scaling_factor (float)
- kv_lora_rank (int)
- q_lora_rank (int)
- qk_rope_head_dim (int)
- v_head_dim (int)
- qk_nope_head_dim (int)
- topk_method (str)
- n_group (int)
- topk_group (int)
- num_experts_per_tok (int)
- first_k_dense_replace (int)
- norm_topk_prob (bool)
- hidden_act (str)
- max_position_embeddings (int)
- max_seq_len (int)
- rms_norm_eps (float)
- tie_word_embeddings (bool)
- rope_theta (float)
- rope_scaling (dict[str, Any] | None)
- rope_interleave (bool)
- scoring_func (str)
- attention_bias (bool)
- attention_dropout (float)
- norm_dtype (DType)
- gate_dtype (DType | None)
- correction_bias_dtype (DType | None)
- max_batch_context_length (int)
- quant_config (QuantConfig | None)
- ep_config (EPConfig | None)
- graph_mode (str)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
- eagle_aux_hidden_state_layer_ids (list[int] | None)
calculate_max_seq_len()β
classmethod calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the maximum sequence length for the Kimi K2.5 language model.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initializes a DeepseekV3Config instance from pipeline configuration.
This method creates a config instance with all fields that can be determined from the pipeline configuration, without needing the state_dict. Fields that depend on the state_dict (like norm_dtype, quant_config, etc.) should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
An initialized DeepseekV3Config instance.
-
Return type:
VisionConfigβ
class max.pipelines.architectures.kimik2_5.VisionConfig(dtype, devices, init_pos_emb_height, init_pos_emb_time, init_pos_emb_width, merge_kernel_size, mm_hidden_size, patch_size, projector_ln_eps, text_hidden_size, vt_hidden_size, vt_intermediate_size, vt_num_attention_heads, vt_num_hidden_layers, merge_type=None, mm_projector_type=None, model_type='', pos_emb_type=None, projector_hidden_act=None, video_attn_type=None, has_bias=True, in_channels=3, rope_max_height=512, rope_max_width=512, rope_theta=10000.0)
Bases: object
Vision configuration for Kimi-K2.5 models with required fields.
-
Parameters:
-
- dtype (DType)
- devices (list[DeviceRef])
- init_pos_emb_height (int)
- init_pos_emb_time (int)
- init_pos_emb_width (int)
- merge_kernel_size (list[int])
- mm_hidden_size (int)
- patch_size (int)
- projector_ln_eps (float)
- text_hidden_size (int)
- vt_hidden_size (int)
- vt_intermediate_size (int)
- vt_num_attention_heads (int)
- vt_num_hidden_layers (int)
- merge_type (str | None)
- mm_projector_type (str | None)
- model_type (str)
- pos_emb_type (str | None)
- projector_hidden_act (str | None)
- video_attn_type (str | None)
- has_bias (bool)
- in_channels (int)
- rope_max_height (int)
- rope_max_width (int)
- rope_theta (float)
devicesβ
Devices that the Kimi-K2.5 vision encoder model is parallelized over.
dtypeβ
dtype: DType
DType of the Kimi-K2.5 vision model weights.
finalize()β
finalize(vision_dtype)
Finalize VisionConfig with state_dict dependent fields.
-
Parameters:
-
vision_dtype (DType)
-
Return type:
-
None
has_biasβ
has_bias: bool = True
Whether linear projections in the vision transformer include bias terms.
in_channelsβ
in_channels: int = 3
Number of input image channels (3 for RGB).
init_pos_emb_heightβ
init_pos_emb_height: int
Height of the initial position embedding.
init_pos_emb_timeβ
init_pos_emb_time: int
Time of the initial position embedding.
init_pos_emb_widthβ
init_pos_emb_width: int
Width of the initial position embedding.
initialize_from_config()β
classmethod initialize_from_config(pipeline_config, hf_vision_config, huggingface_config=None)
Initialize VisionConfig from HuggingFace vision config.
-
Parameters:
-
- pipeline_config (PipelineConfig) β MAX Engine pipeline configuration.
- hf_vision_config (AutoConfig) β HuggingFace vision sub-config.
- huggingface_config (AutoConfig | None) β Full HuggingFace model config, used to derive
text_hidden_sizefromtext_config.hidden_sizewhenhf_vision_configdoes not carry the attribute directly (e.g. moonshotai/Kimi-VL-A3B-Instruct vs nvidia/Kimi-K2.5-NVFP4).
-
Return type:
Note: dtype fields will be set to defaults and should be updated via finalize() once state_dict is available.
merge_kernel_sizeβ
Kernel size for the merge operation.
merge_typeβ
Type of the merge operation.
mm_hidden_sizeβ
mm_hidden_size: int
Hidden size of the multi-modal hidden layer.
mm_projector_typeβ
Type of the multi-modal projector.
model_typeβ
model_type: str = ''
Type of the model.
patch_sizeβ
patch_size: int
Size of the patch.
pos_emb_typeβ
Type of the position embedding.
projector_hidden_actβ
Activation function for the projector.
projector_ln_epsβ
projector_ln_eps: float
Epsilon for the layer normalization.
rope_max_heightβ
rope_max_height: int = 512
Maximum grid height for RoPE frequency precomputation. Hardcoded to 512 in https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.py#L571
rope_max_widthβ
rope_max_width: int = 512
Maximum grid width for RoPE frequency precomputation. Hardcoded to 512 in https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.py#L571
rope_thetaβ
rope_theta: float = 10000.0
Base for the RoPE inverse-frequency exponent. Hardcoded to 10000 in https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.py#L379
text_hidden_sizeβ
text_hidden_size: int
Hidden size of the text hidden layer.
video_attn_typeβ
Type of the video attention.
vt_hidden_sizeβ
vt_hidden_size: int
Hidden size of the video hidden layer.
vt_intermediate_sizeβ
vt_intermediate_size: int
Intermediate size of the video hidden layer.
vt_num_attention_headsβ
vt_num_attention_heads: int
Number of attention heads of the video hidden layer.
vt_num_hidden_layersβ
vt_num_hidden_layers: int
Number of hidden layers of the video hidden layer.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!