For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.unified_dflash_kimi_k25

DFlash speculative decoding for Kimi K2.5 with unified graph compilation.

`UnifiedDflashKimiK25`

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25(config)

source

Bases: Module

Fused: merge -> target (MLA) -> reject -> materialize -> draft block.

Parameters:: config (UnifiedDflashKimiK25Config)

`input_types()`

input_types(kv_params)

source

Input types mirror Eagle3MHAKimiK25Unified.input_types.

kv_params is the unified {"target", "draft"} tree; the target leaf is MLA and the draft leaf is MHA, each carrying its own blocks and dispatch metadata. Distributed (DP + signals + EP) MHA-draft graph (no vision, no in-thinking-phase, no structured output). See build_spec_decode_input_types() for the canonical ordering.

Parameters:: kv_params (MultiKVCacheParams)
Return type:: tuple[TensorType | BufferType, …]

`UnifiedDflashKimiK25Config`

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Config(*, target, draft, speculative_config, target_layer_ids=<factory>, mask_token_id=0, block_size=0)

source

Bases: ArchConfigWithKVCache

Unified config for the DFlash Kimi K2.5 pipeline.

Holds the Kimi target (DeepseekV3Config populated from a KimiK25ForConditionalGeneration HF config) and the DFlash draft (DFlashKimiK25DraftConfig built from the draft HF config).

Parameters:

target (DeepseekV3Config)
draft (DFlashKimiK25DraftConfig)
speculative_config (SpeculativeConfig)
target_layer_ids (list[int])
mask_token_id (int)
block_size (int)

`block_size`

block_size: int = 0

source

`devices`

property devices: list[DeviceRef]

source

Exposes the target’s devices so this unified config satisfies the ModelConfigWithKVCache protocol KimiK25MemoryPlanner requires (target and draft share placement; __post_init__ checks the device count, and both are built from the target’s devices).

`draft`

draft: DFlashKimiK25DraftConfig

source

`get_kv_params()`

get_kv_params()

source

KV cache parameters to use when running the model.

Return type:: KVCacheParamInterface

`get_max_seq_len()`

get_max_seq_len()

source

Returns the default maximum sequence length for the model.

Subclasses should determine whether this value can be overridden by setting the --max-length (pipeline_config.model.max_length) flag.

Return type:: int

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Build an early placeholder config for KV memory estimation.

The DFlash-specific fields are populated in UnifiedDflashKimiK25Model.load_model() once the draft HF config has been parsed; we then re-instantiate the config with the real values.

Parameters:

pipeline_config (PipelineConfig)
model_config (MAXModelConfig | None)

Return type:

Self

`mask_token_id`

mask_token_id: int = 0

source

`resolve_block_size()`

resolve_block_size(*, default=None)

source

Parameters:: default (int | None)
Return type:: int

`speculative_config`

speculative_config: SpeculativeConfig

source

`target`

target: DeepseekV3Config

source

`target_layer_ids`

target_layer_ids: list[int]

source

`validate_dflash_fields()`

validate_dflash_fields()

source

Strict validation run from UnifiedDflashKimiK25Model.load_model() once the DFlash-specific fields have been populated. __post_init__ accepts the empty placeholder config produced by initialize() so we can’t enforce these there.

Return type:: None

`UnifiedDflashKimiK25Inputs`

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, image_token_indices=None, precomputed_image_embeddings=None, pixel_values=None, grid_thws=None, cu_seqlens=None, max_seqlen=None, vision_position_ids=None, language_image_embeddings=<factory>, language_image_token_indices=<factory>, eplb_counter_buffers=<factory>, token_bitmasks=None, *, kv_cache_inputs=None, lora=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=(), draft_tokens=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, pinned_bitmask=None, wait_payload=None, device_bitmask_scratch=None, structured_output=False)

source

Bases: UnifiedSpecDecodeInputs, KimiK2_5ModelInputs

Inputs for the unified DFlash Kimi K2.5 graph.

Same as KimiK2_5ModelInputs plus the spec-decode fields and trailing buffer packing from UnifiedSpecDecodeInputs. The draft owns its own MHA KVCacheInputs so its dispatch metadata is independent of the target’s MLA cache. The DFlash graph does not bind in_thinking_phase.

Parameters:

tokens (Buffer)
input_row_offsets (Buffer)
signal_buffers (list[Buffer])
host_input_row_offsets (Buffer)
batch_context_lengths (list[Buffer])
image_token_indices (list[Buffer] | None)
precomputed_image_embeddings (list[Buffer] | None)
pixel_values (list[Buffer] | None)
grid_thws (list[Buffer] | None)
cu_seqlens (list[Buffer] | None)
max_seqlen (list[Buffer] | None)
vision_position_ids (list[Buffer] | None)
language_image_embeddings (list[Buffer])
language_image_token_indices (list[Buffer])
eplb_counter_buffers (list[Buffer])
token_bitmasks (Buffer | None)
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
lora (LoRAInputs | None)
hidden_states (Buffer | list[Buffer] | None)
return_n_logits (Buffer)
data_parallel_splits (Buffer)
ep_inputs (tuple[Buffer, ...])
draft_tokens (Buffer | None)
seed (Buffer | None)
temperature (Buffer | None)
top_k (Buffer | None)
max_k (Buffer | None)
top_p (Buffer | None)
min_top_p (Buffer | None)
in_thinking_phase (Buffer | None)
pinned_bitmask (Buffer | None)
wait_payload (Buffer | None)
device_bitmask_scratch (Buffer | None)
structured_output (bool)

`buffers`

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

`token_bitmasks`

token_bitmasks: Buffer | None = None

source

`UnifiedDflashKimiK25Model`

class max.pipelines.architectures.unified_dflash_kimi_k25.UnifiedDflashKimiK25Model(*args, **kwargs)

source

Bases: _UnifiedSpecDecodeModelMixin, KimiK2_5Model

Unified DFlash Kimi K2.5 pipeline model.

Routed here when target HF arch is KimiK25ForConditionalGeneration and SpeculativeConfig.is_dflash() is true.

Parameters:

args (Any)
kwargs (Any)

`batch_processor_cls`

batch_processor_cls

source

alias of UnifiedDflashKimiK25BatchProcessor

`load_model()`

load_model(session)

source

Load the model with the given weights.

Parameters:: session (InferenceSession)
Return type:: tuple[Model, Model]

`prepare_initial_token_inputs()`

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1, draft_tokens=None, **kwargs)

source

Delegates to the batch processor; typed for Eagle subclasses.

Parameters:

replica_batches (Sequence[Sequence[KimiK2_5TextAndVisionContext]])
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
return_n_logits (int)
draft_tokens (Buffer | None)
kwargs (Any)

Return type:

UnifiedDflashKimiK25Inputs

UnifiedDflashKimiK25​

input_types()​

UnifiedDflashKimiK25Config​

block_size​

devices​

draft​

get_kv_params()​

get_max_seq_len()​

initialize()​

mask_token_id​

resolve_block_size()​

speculative_config​

target​

target_layer_ids​

validate_dflash_fields()​

UnifiedDflashKimiK25Inputs​

buffers​

token_bitmasks​

UnifiedDflashKimiK25Model​

batch_processor_cls​

load_model()​

prepare_initial_token_inputs()​

`UnifiedDflashKimiK25`

`input_types()`

`UnifiedDflashKimiK25Config`

`block_size`

`devices`

`draft`

`get_kv_params()`

`get_max_seq_len()`

`initialize()`

`mask_token_id`

`resolve_block_size()`

`speculative_config`

`target`

`target_layer_ids`

`validate_dflash_fields()`

`UnifiedDflashKimiK25Inputs`

`buffers`

`token_bitmasks`

`UnifiedDflashKimiK25Model`

`batch_processor_cls`

`load_model()`

`prepare_initial_token_inputs()`