IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.dflash_llama3

DFlash draft model for Llama3-family targets.

The draft is a Qwen3-style transformer (per-head Q/K RMSNorm, non-causal attention) that fuses concatenated target hidden states into its KV cache via AttentionWithRope.materialize_kv_from_hidden() and runs a single non-causal block forward over [verified_id, MASK, MASK, …] per iteration.

DFlashLlama3

class max.pipelines.architectures.dflash_llama3.DFlashLlama3(config, *, num_context_features)

source

Bases: Module

DFlash draft transformer for a Llama3 target.

Parameters:

forward_block()

forward_block(input_embeds, kv_collection, input_row_offsets)

source

Parameters:

Return type:

TensorValue

materialize_kv()

materialize_kv(ctx_hidden, input_row_offsets, kv_collection)

source

Parameters:

Return type:

None

project_target_hidden()

project_target_hidden(target_hs_concat)

source

Parameters:

target_hs_concat (TensorValue)

Return type:

TensorValue

DFlashLlama3Model

class max.pipelines.architectures.dflash_llama3.DFlashLlama3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: LlamaModelBase

Placeholder pipeline model for the DFlash draft architecture.

See module docstring. execute raises because the draft is only ever run via the unified pipeline.

Parameters:

execute()

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.