For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python module
max.pipelines.architectures.unified_dflash_llama3
DFlash speculative decoding for Llama3 with unified graph compilation.
DflashDraftHFConfigβ
class max.pipelines.architectures.unified_dflash_llama3.DflashDraftHFConfig(mask_token_id: 'int', target_layer_ids: 'list[int]', block_size: 'int | None' = None, num_target_layers: 'int | None' = None)
Bases: object
-
Parameters:
block_sizeβ
mask_token_idβ
mask_token_id: int
num_target_layersβ
target_layer_idsβ
PersistentInputBuffersβ
class max.pipelines.architectures.unified_dflash_llama3.PersistentInputBuffers(tokens, input_row_offsets)
Bases: object
Pinned-host buffers reused across unified spec-decode batch steps.
alloc()β
classmethod alloc(max_batch_size, max_batch_input_tokens, device)
Allocates persistent token and row-offset buffers for spec-decode batching.
-
Parameters:
-
Return type:
input_row_offsetsβ
input_row_offsets: Buffer
tokensβ
tokens: Buffer
UnifiedDflashLlama3Configβ
class max.pipelines.architectures.unified_dflash_llama3.UnifiedDflashLlama3Config(*, target: 'Llama3Config', draft: 'Llama3Config', speculative_config: 'SpeculativeConfig', target_layer_ids: 'list[int]' = <factory>, mask_token_id: 'int' = 0, block_size: 'int' = 0)
Bases: ArchConfigWithKVCache
-
Parameters:
-
- target (Llama3Config)
- draft (Llama3Config)
- speculative_config (SpeculativeConfig)
- target_layer_ids (list[int])
- mask_token_id (int)
- block_size (int)
block_sizeβ
block_size: int = 0
draftβ
draft: Llama3Config
get_kv_params()β
get_kv_params()
KV cache parameters to use when running the model.
-
Return type:
get_max_seq_len()β
get_max_seq_len()
Returns the default maximum sequence length for the model.
Subclasses should determine whether this value can be overridden by
setting the --max-length (pipeline_config.model.max_length) flag.
-
Return type:
initialize()β
classmethod initialize(pipeline_config, model_config=None)
Initialize the config from a PipelineConfig.
-
Parameters:
-
- pipeline_config (PipelineConfig) β The pipeline configuration.
- model_config (MAXModelConfig | None) β The model configuration to read from. When
None(the default),pipeline_config.modelis used. Pass an explicit config (e.g.pipeline_config.draft_model) to initialize the arch config for a different model.
-
Return type:
mask_token_idβ
mask_token_id: int = 0
resolve_block_size()β
resolve_block_size(*, default=None)
speculative_configβ
speculative_config: SpeculativeConfig
targetβ
target: Llama3Config
target_layer_idsβ
validate_dflash_fields()β
validate_dflash_fields()
Strict validation run from UnifiedDflashLlama3Model.load_model
once the DFlash-specific fields have been populated from the draft
HF config β __post_init__ accepts the empty-placeholder config
produced by initialize() so we canβt enforce these there.
-
Return type:
-
None
UnifiedDflashLlama3Inputsβ
class max.pipelines.architectures.unified_dflash_llama3.UnifiedDflashLlama3Inputs(tokens, input_row_offsets, return_n_logits, *, kv_cache_inputs=None, lora=None, hidden_states=None, draft_tokens=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, pinned_bitmask=None, wait_payload=None, device_bitmask_scratch=None, structured_output=False)
Bases: UnifiedSpecDecodeInputs
Inputs for the unified DFlash Llama3 graph.
The spec-decode fields and trailing buffer packing come from
UnifiedSpecDecodeInputs; tokens / input_row_offsets /
return_n_logits plus the KV cache form this single-device graphβs
prefix. The DFlash graph does not bind in_thinking_phase.
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (Buffer)
- return_n_logits (Buffer)
- kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
- lora (LoRAInputs | None)
- hidden_states (Buffer | list[Buffer] | None)
- draft_tokens (Buffer | None)
- seed (Buffer | None)
- temperature (Buffer | None)
- top_k (Buffer | None)
- max_k (Buffer | None)
- top_p (Buffer | None)
- min_top_p (Buffer | None)
- in_thinking_phase (Buffer | None)
- pinned_bitmask (Buffer | None)
- wait_payload (Buffer | None)
- device_bitmask_scratch (Buffer | None)
- structured_output (bool)
buffersβ
Returns positional Buffer inputs for model ABI calls.
input_row_offsetsβ
input_row_offsets: Buffer
return_n_logitsβ
return_n_logits: Buffer
tokensβ
tokens: Buffer
UnifiedDflashLlama3Modelβ
class max.pipelines.architectures.unified_dflash_llama3.UnifiedDflashLlama3Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, max_batch_size=1)
Bases: _UnifiedSpecDecodeModelMixin, PipelineModelWithKVCache[TextContext]
Unified DFlash Llama3: target + draft in one compiled graph.
-
Parameters:
-
- pipeline_config (PipelineConfig)
- session (InferenceSession)
- devices (list[Device])
- kv_cache_config (KVCacheConfig)
- weights (Weights)
- adapter (WeightsAdapter | None)
- return_logits (ReturnLogits)
- return_hidden_states (ReturnHiddenStates)
- max_batch_size (int)
batch_processor_clsβ
batch_processor_cls
alias of UnifiedDflashLlama3BatchProcessor
get_kv_params()β
classmethod get_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)
Returns the KV cache params for the pipeline model.
Delegates to model_config_cls.construct_kv_params(...).
Subclasses with custom KV behavior should override this method.
-
Parameters:
-
- huggingface_config (PreTrainedConfig)
- pipeline_config (PipelineConfig)
- devices (list[DeviceRef])
- kv_cache_config (KVCacheConfig)
- cache_dtype (DType)
-
Return type:
load_model()β
load_model(session)
-
Parameters:
-
session (InferenceSession)
-
Return type:
modelβ
model: Model
model_config_clsβ
model_config_cls
alias of UnifiedDflashLlama3Config
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!