IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.unified_dflash_kimi_k25

DFlash speculative decoding for Kimi K2.5 with unified graph compilation.

UnifiedDflashKimiK25​

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25(config)

source

Bases: Module

Fused: merge -> target (MLA) -> reject -> materialize -> draft block.

Parameters:

config (UnifiedDflashKimiK25Config)

input_types()​

input_types(kv_params, draft_kv_params)

source

Input types mirror Eagle3MHAKimiK25Unified.input_types.

Order:
tokens, device_offsets, host_offsets, return_n_logits, data_parallel_splits, signal_buffers, target_kv_cache (flat), batch_context_lengths, target_ep_inputs, draft_tokens, draft_kv_blocks (one per device), seed, temperature, top_k, max_k, top_p, min_top_p.

Parameters:

Return type:

tuple[TensorType | BufferType, …]

UnifiedDflashKimiK25Config​

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Config(*, target, draft, speculative_config, target_layer_ids=<factory>, mask_token_id=0, block_size=0)

source

Bases: ArchConfigWithKVCache

Unified config for the DFlash Kimi K2.5 pipeline.

Holds the Kimi target (DeepseekV3Config populated from a KimiK25ForConditionalGeneration HF config) and the DFlash draft (DFlashKimiK25DraftConfig built from the draft HF config).

Parameters:

block_size​

block_size: int = 0

source

draft​

draft: DFlashKimiK25DraftConfig

source

get_kv_params()​

get_kv_params()

source

KV cache parameters to use when running the model.

Return type:

KVCacheParamInterface

get_max_seq_len()​

get_max_seq_len()

source

Returns the default maximum sequence length for the model.

Subclasses should determine whether this value can be overridden by setting the --max-length (pipeline_config.model.max_length) flag.

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Build an early placeholder config for KV memory estimation.

The DFlash-specific fields are populated in UnifiedDflashKimiK25Model.load_model() once the draft HF config has been parsed; we then re-instantiate the config with the real values.

Parameters:

Return type:

Self

mask_token_id​

mask_token_id: int = 0

source

resolve_block_size()​

resolve_block_size(*, default=None)

source

Parameters:

default (int | None)

Return type:

int

speculative_config​

speculative_config: SpeculativeConfig

source

target​

target: DeepseekV3Config

source

target_layer_ids​

target_layer_ids: list[int]

source

validate_dflash_fields()​

validate_dflash_fields()

source

Strict validation run from UnifiedDflashKimiK25Model.load_model() once the DFlash-specific fields have been populated. __post_init__ accepts the empty placeholder config produced by initialize() so we can’t enforce these there.

Return type:

None

UnifiedDflashKimiK25Inputs​

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, image_token_indices=None, precomputed_image_embeddings=None, pixel_values=None, grid_thws=None, cu_seqlens=None, max_seqlen=None, vision_position_ids=None, language_image_embeddings=<factory>, language_image_token_indices=<factory>, draft_tokens=None, draft_kv_blocks=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, token_bitmasks=None, *, kv_cache_inputs=None, lora_ids=None, lora_ranks=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=())

source

Bases: KimiK2_5ModelInputs

Inputs for the unified DFlash Kimi K2.5 graph.

Same as KimiK2_5ModelInputs plus DFlash draft buffers. The draft owns its own MHA KVCacheInputs so its dispatch metadata is independent of the target’s MLA cache.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns the language model input ABI tuple.

draft_kv_blocks​

draft_kv_blocks: list[Buffer] | None = None

source

draft_tokens​

draft_tokens: Buffer | None = None

source

in_thinking_phase​

in_thinking_phase: Buffer | None = None

source

max_k​

max_k: Buffer | None = None

source

min_top_p​

min_top_p: Buffer | None = None

source

seed​

seed: Buffer | None = None

source

temperature​

temperature: Buffer | None = None

source

token_bitmasks​

token_bitmasks: Buffer | None = None

source

top_k​

top_k: Buffer | None = None

source

top_p​

top_p: Buffer | None = None

source

UnifiedDflashKimiK25Model​

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Model(*args, **kwargs)

source

Bases: KimiK2_5Model

Unified DFlash Kimi K2.5 pipeline model.

Routed here when target HF arch is KimiK25ForConditionalGeneration and SpeculativeConfig.is_dflash() is true.

Parameters:

execute()​

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

UnifiedEagleOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

load_model()​

load_model(session)

source

Load the model with the given weights.

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, draft_tokens=None, draft_kv_cache_buffers=None, **kwargs)

source

Prepares the initial inputs to be passed to execute().

The inputs and functionality can vary per model. For example, model inputs could include encoded tensors, unique IDs per tensor when using a KV cache manager, and kv_cache_inputs (or None if the model does not use KV cache). This method typically batches encoded tensors, claims a KV cache slot if needed, and returns the inputs and caches.

Parameters:

Return type:

UnifiedDflashKimiK25Inputs

prepare_next_token_inputs()​

prepare_next_token_inputs(next_tokens, prev_model_inputs)

source

Prepares the secondary inputs to be passed to execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

Parameters:

Return type:

KimiK2_5ModelInputs