For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python module
max.pipelines.architectures.unified_dflash_kimi_k25
DFlash speculative decoding for Kimi K2.5 with unified graph compilation.
UnifiedDflashKimiK25β
class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25(config)
Bases: Module
Fused: merge -> target (MLA) -> reject -> materialize -> draft block.
-
Parameters:
-
config (UnifiedDflashKimiK25Config)
input_types()β
input_types(kv_params, draft_kv_params)
Input types mirror Eagle3MHAKimiK25Unified.input_types.
- Order:
- tokens, device_offsets, host_offsets, return_n_logits, data_parallel_splits, signal_buffers, target_kv_cache (flat), batch_context_lengths, target_ep_inputs, draft_tokens, draft_kv_blocks (one per device), seed, temperature, top_k, max_k, top_p, min_top_p.
-
Parameters:
-
- kv_params (KVCacheParamInterface)
- draft_kv_params (KVCacheParams)
-
Return type:
-
tuple[TensorType | BufferType, β¦]
UnifiedDflashKimiK25Configβ
class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Config(*, target, draft, speculative_config, target_layer_ids=<factory>, mask_token_id=0, block_size=0)
Bases: ArchConfigWithKVCache
Unified config for the DFlash Kimi K2.5 pipeline.
Holds the Kimi target (DeepseekV3Config populated from a
KimiK25ForConditionalGeneration HF config) and the DFlash draft
(DFlashKimiK25DraftConfig built from the draft HF config).
-
Parameters:
-
- target (DeepseekV3Config)
- draft (DFlashKimiK25DraftConfig)
- speculative_config (SpeculativeConfig)
- target_layer_ids (list[int])
- mask_token_id (int)
- block_size (int)
block_sizeβ
block_size: int = 0
draftβ
draft: DFlashKimiK25DraftConfig
get_kv_params()β
get_kv_params()
KV cache parameters to use when running the model.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the default maximum sequence length for the model.
Subclasses should determine whether this value can be overridden by
setting the --max-length (pipeline_config.model.max_length) flag.
-
Return type:
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Build an early placeholder config for KV memory estimation.
The DFlash-specific fields are populated in
UnifiedDflashKimiK25Model.load_model() once the draft HF config
has been parsed; we then re-instantiate the config with the real
values.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- model_config (MAXModelConfig | None)
-
Return type:
mask_token_idβ
mask_token_id: int = 0
resolve_block_size()β
resolve_block_size(*, default=None)
speculative_configβ
speculative_config: SpeculativeConfig
targetβ
target: DeepseekV3Config
target_layer_idsβ
validate_dflash_fields()β
validate_dflash_fields()
Strict validation run from
UnifiedDflashKimiK25Model.load_model() once the DFlash-specific
fields have been populated. __post_init__ accepts the empty
placeholder config produced by initialize() so we canβt enforce
these there.
-
Return type:
-
None
UnifiedDflashKimiK25Inputsβ
class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, image_token_indices=None, precomputed_image_embeddings=None, pixel_values=None, grid_thws=None, cu_seqlens=None, max_seqlen=None, vision_position_ids=None, language_image_embeddings=<factory>, language_image_token_indices=<factory>, draft_tokens=None, draft_kv_blocks=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, token_bitmasks=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())
Bases: KimiK2_5ModelInputs
Inputs for the unified DFlash Kimi K2.5 graph.
Same as KimiK2_5ModelInputs plus DFlash draft buffers. The
draft owns its own MHA KVCacheInputs so its dispatch
metadata is independent of the targetβs MLA cache.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (Buffer)
- signal_buffers (list[Buffer])
- host_input_row_offsets (Buffer)
- batch_context_lengths (list[Buffer])
- image_token_indices (list[Buffer] | None)
- precomputed_image_embeddings (list[Buffer] | None)
- pixel_values (list[Buffer] | None)
- grid_thws (list[Buffer] | None)
- cu_seqlens (list[Buffer] | None)
- max_seqlen (list[Buffer] | None)
- vision_position_ids (list[Buffer] | None)
- language_image_embeddings (list[Buffer])
- language_image_token_indices (list[Buffer])
- draft_tokens (Buffer | None)
- draft_kv_blocks (list[Buffer] | None)
- seed (Buffer | None)
- temperature (Buffer | None)
- top_k (Buffer | None)
- max_k (Buffer | None)
- top_p (Buffer | None)
- min_top_p (Buffer | None)
- in_thinking_phase (Buffer | None)
- token_bitmasks (Buffer | None)
- kv_cache_inputs (KVCacheInputs[Buffer, Buffer] | None)
- lora_ids (Buffer | None)
- lora_ranks (Buffer | None)
- hidden_states (Buffer | list[Buffer] | None)
- return_n_logits (Buffer)
- data_parallel_splits (Buffer)
- ep_inputs (tuple[Buffer, ...])
buffersβ
Returns the language model input ABI tuple.
draft_kv_blocksβ
draft_tokensβ
in_thinking_phaseβ
max_kβ
min_top_pβ
seedβ
temperatureβ
token_bitmasksβ
top_kβ
top_pβ
UnifiedDflashKimiK25Modelβ
class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Model(*args, **kwargs)
Bases: KimiK2_5Model
Unified DFlash Kimi K2.5 pipeline model.
Routed here when target HF arch is
KimiK25ForConditionalGeneration and
SpeculativeConfig.is_dflash() is true.
execute()β
execute(model_inputs)
Executes the graph with the given inputs.
-
Parameters:
-
model_inputs (ModelInputs) β The model inputs to execute, containing tensors and any other required data for model execution.
-
Returns:
-
ModelOutputs containing the pipelineβs output tensors.
-
Return type:
This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.
load_model()β
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
prepare_initial_token_inputs()β
prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, draft_tokens=None, draft_kv_cache_buffers=None, **kwargs)
Prepares the initial inputs to be passed to execute().
The inputs and functionality can vary per model. For example, model
inputs could include encoded tensors, unique IDs per tensor when using
a KV cache manager, and kv_cache_inputs (or None if the model does
not use KV cache). This method typically batches encoded tensors,
claims a KV cache slot if needed, and returns the inputs and caches.
prepare_next_token_inputs()β
prepare_next_token_inputs(next_tokens, prev_model_inputs)
Prepares the secondary inputs to be passed to execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
-
Parameters:
-
- next_tokens (Buffer)
- prev_model_inputs (ModelInputs)
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!