Skip to main content

Python module

max.pipelines.architectures.dflash_llama3

DFlash draft model for Llama3-family targets.

The draft is a Qwen3-style transformer (per-head Q/K RMSNorm, non-causal attention) that fuses concatenated target hidden states into its KV cache via AttentionWithRope.materialize_kv_from_hidden() and runs a single non-causal block forward over [verified_id, MASK, MASK, …] per iteration.

DFlashLlama3

class max.pipelines.architectures.dflash_llama3.DFlashLlama3(config, *, num_context_features)

source

Bases: Module

DFlash draft transformer for a Llama3 target.

Parameters:

forward_block()

forward_block(input_embeds, kv_collection, input_row_offsets)

source

Parameters:

Return type:

TensorValue

materialize_kv()

materialize_kv(ctx_hidden, input_row_offsets, kv_collection)

source

Parameters:

Return type:

None

project_target_hidden()

project_target_hidden(target_hs_concat)

source

Parameters:

target_hs_concat (TensorValue)

Return type:

TensorValue

DFlashLlama3Model

class max.pipelines.architectures.dflash_llama3.DFlashLlama3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: LlamaModelBase

Placeholder pipeline model for the DFlash draft architecture.

See module docstring. execute raises because the draft is only ever run via the unified pipeline.

Parameters:

execute()

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.