Python module
max.pipelines.architectures.kimik2_5
Kimi K2.5 mixture-of-experts architecture for text generation.
KimiK2_5Config
class max.pipelines.architectures.kimik2_5.KimiK2_5Config(*, devices, dtype, bos_token_id, eos_token_id, ignore_index, media_placeholder_token_id, pad_token_id, tie_word_embeddings, use_unified_vision_chunk, video_placeholder, vision_config, llm_config)
Bases: ArchConfigWithKVAndVisionCache
Configuration for Kimi-K2.5 models.
-
Parameters:
-
- devices (list[DeviceRef])
- dtype (DType)
- bos_token_id (int)
- eos_token_id (int)
- ignore_index (int)
- media_placeholder_token_id (int)
- pad_token_id (int)
- tie_word_embeddings (bool)
- use_unified_vision_chunk (bool | None)
- video_placeholder (str | None)
- vision_config (VisionConfig)
- llm_config (KimiK2_5TextConfig)
bos_token_id
bos_token_id: int
ID of the beginning-of-sequence (BOS) token.
calculate_max_seq_len()
static calculate_max_seq_len(pipeline_config, huggingface_config)
-
Parameters:
-
- pipeline_config (PipelineConfig)
- huggingface_config (AutoConfig)
-
Return type:
construct_kv_params()
static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
devices
Devices that the Kimi-K2.5 model is parallelized over.
dtype
dtype: DType
DType of the Kimi-K2.5 model weights.
eos_token_id
eos_token_id: int
ID of the end-of-sequence (EOS) token.
estimate_vision_cache_entry_bytes()
static estimate_vision_cache_entry_bytes(huggingface_config)
Estimate per-entry bytes for the vision encoder cache.
Max tokens per image = pos_emb_height * pos_emb_width / merge_sq, multiplied by the text hidden size and 2 bytes (bfloat16).
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
get_kv_params()
get_kv_params()
Returns the KV cache parameters from the embedded LLM config.
-
Return type:
get_max_seq_len()
get_max_seq_len()
Returns the maximum sequence length from the embedded LLM config.
-
Return type:
get_num_layers()
static get_num_layers(huggingface_config)
-
Parameters:
-
huggingface_config (AutoConfig)
-
Return type:
ignore_index
ignore_index: int
Index that should be ignored when calculating loss (e.g., for padding).
initialize()
classmethod initialize(pipeline_config, model_config=None)
Initializes a Qwen3VLConfig instance from pipeline configuration.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- model_config (MAXModelConfig | None)
-
Returns:
-
A Qwen3VLConfig instance with fields initialized from config.
-
Return type:
initialize_from_config()
classmethod initialize_from_config(pipeline_config, huggingface_config, llm_config=None)
Initializes a KimiK2_5Config from pipeline and HuggingFace configs.
This method creates a config instance with all fields that can be determined from the pipeline and HuggingFace configurations, without needing the state_dict. Fields that depend on the state_dict should be set via the finalize() method.
-
Parameters:
-
- pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
- huggingface_config (AutoConfig) – HuggingFace model configuration.
- llm_config (KimiK2_5TextConfig | None) – Pre-initialized DeepseekV3 configuration.
-
Returns:
-
A KimiK2_5Config instance ready for finalization.
-
Return type:
llm_config
llm_config: KimiK2_5TextConfig
Language model configuration using DeepseekV3 architecture.
media_placeholder_token_id
media_placeholder_token_id: int
Token ID used as a placeholder for media (e.g., images, video frames) within sequences.
pad_token_id
pad_token_id: int
Token ID used for padding sequences to uniform length.
tie_word_embeddings
tie_word_embeddings: bool
Whether to share (tie) the input and output word embeddings in the language model.
use_unified_vision_chunk
Whether to use a unified chunk for vision inputs.
video_placeholder
Placeholder string used to represent video segments in input text.
vision_config
vision_config: VisionConfig
Vision encoder configuration.
KimiK2_5Model
class max.pipelines.architectures.kimik2_5.KimiK2_5Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE)
Bases: AlwaysSignalBuffersMixin, PipelineModelWithKVCache[KimiK2_5TextAndVisionContext]
A Kimi-K2.5 pipeline model for multimodal text generation.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
calculate_max_seq_len()
classmethod calculate_max_seq_len(pipeline_config, huggingface_config)
Calculates the optimal max sequence length for the model.
Models are expected to implement this method. The following example shows how to implement it for a Mistral model:
class MistralModel(PipelineModel):
@classmethod
def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int:
try:
return upper_bounded_default(
upper_bound=huggingface_config.max_seq_len,
default=pipeline_config.model.max_length,
)
except ValueError as e:
raise ValueError(
"Unable to infer max_length for Mistral, the provided "
f"max_length ({pipeline_config.model.max_length}) exceeds the "
f"model's max_seq_len ({huggingface_config.max_seq_len})."
) from e-
Parameters:
-
- pipeline_config (PipelineConfig) – Configuration for the pipeline.
- huggingface_config (AutoConfig) – Hugging Face model configuration.
-
Returns:
-
The maximum sequence length to use.
-
Return type:
estimate_activation_memory()
classmethod estimate_activation_memory(pipeline_config, huggingface_config)
Estimates the activation memory required for model execution.
This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.
-
Parameters:
-
- pipeline_config (PipelineConfig) – Pipeline configuration
- huggingface_config (AutoConfig) – HuggingFace model configuration
-
Returns:
-
Estimated activation memory in bytes
-
Return type:
estimate_weights_size()
classmethod estimate_weights_size(pipeline_config)
Calculates the estimated memory consumption of our model.
-
Parameters:
-
pipeline_config (PipelineConfig)
-
Return type:
execute()
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipeline’s output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
get_kv_params()
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
-
Parameters:
-
- huggingface_config (AutoConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
language_model
language_model: Model
The compiled language model for text generation.
load_model()
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
model
property model: Model
Expose language model for graph capture/replay.
Only the language model is captured since vision runs during prefill
prepare_initial_token_inputs()
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
prepare_next_token_inputs()
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
-
KimiK2_5ModelInputs
release()
release(request_id)
Release vision encoder cache entries for a completed request.
-
Parameters:
-
request_id (RequestID)
-
Return type:
-
None
vision_model
vision_model: Model
The compiled vision model for processing images.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!