For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.unified_mtp_gemma4

Gemma4 with MTP draft model for speculative decoding with unified graph compilation.

`UnifiedMTPGemma4Inputs`

class max.pipelines.architectures.unified_mtp_gemma4.UnifiedMTPGemma4Inputs(tokens, input_row_offsets, host_input_row_offsets, return_n_logits, data_parallel_splits, signal_buffers, batch_context_lengths, images=None, video=None, combined_embeds=None, combined_indices=None, *, kv_cache_inputs=None, lora=None, hidden_states=None, draft_tokens=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, pinned_bitmask=None, wait_payload=None, device_bitmask_scratch=None, structured_output=False)

source

Bases: UnifiedSpecDecodeInputs

Inputs for the UnifiedMTPGemma4 model.

The spec-decode fields and trailing buffer packing come from UnifiedSpecDecodeInputs; the fields below plus the KV cache form this distributed MTP graph’s prefix. The graph binds the per-row in_thinking_phase flag and, when structured output is enabled, the constrained-decoding bitmask triple.

Parameters:

tokens (Buffer)
input_row_offsets (Buffer)
host_input_row_offsets (Buffer)
return_n_logits (Buffer)
data_parallel_splits (Buffer)
signal_buffers (list[Buffer])
batch_context_lengths (list[Buffer])
images (ImageInputs | None)
video (VideoInputs | None)
combined_embeds (list[Buffer] | None)
combined_indices (list[Buffer] | None)
kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
lora (LoRAInputs | None)
hidden_states (Buffer | list[Buffer] | None)
draft_tokens (Buffer | None)
seed (Buffer | None)
temperature (Buffer | None)
top_k (Buffer | None)
max_k (Buffer | None)
top_p (Buffer | None)
min_top_p (Buffer | None)
in_thinking_phase (Buffer | None)
pinned_bitmask (Buffer | None)
wait_payload (Buffer | None)
device_bitmask_scratch (Buffer | None)
structured_output (bool)

`batch_context_lengths`

batch_context_lengths: list[Buffer]

source

`buffers`

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

`combined_embeds`

combined_embeds: list[Buffer] | None = None

source

`combined_indices`

combined_indices: list[Buffer] | None = None

source

`data_parallel_splits`

data_parallel_splits: Buffer

source

`host_input_row_offsets`

host_input_row_offsets: Buffer

source

`images`

images: ImageInputs | None = None

source

`input_row_offsets`

input_row_offsets: Buffer

source

`return_n_logits`

return_n_logits: Buffer

source

`signal_buffers`

signal_buffers: list[Buffer]

source

`tokens`

tokens: Buffer

source

`video`

video: VideoInputs | None = None

source

`UnifiedMTPGemma4Model`

class max.pipelines.architectures.unified_mtp_gemma4.UnifiedMTPGemma4Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, max_batch_size=1)

source

Bases: _UnifiedSpecDecodeModelMixin, AlwaysSignalBuffersMixin, MultiGraphPipelineModelWithKVCache[Gemma4Context]

Gemma4 with MTP: merge + target + rejection + shift in one graph.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)
return_hidden_states (ReturnHiddenStates)
max_batch_size (int)

`batch_processor_cls`

batch_processor_cls

source

alias of UnifiedMTPGemma4BatchProcessor

`calculate_max_seq_len()`

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the optimal max sequence length for the model.

Default implementation delegates to model_config_cls. Override when pipeline-model semantics differ from the config (for example, bounding max_length where the config is permissive).

Parameters:

pipeline_config (PipelineConfig) – Configuration for the pipeline.
huggingface_config (AutoConfig) – Hugging Face model configuration.

Returns:

The maximum sequence length to use.

Return type:

int

`execute()`

execute(model_inputs)

source

Execute and return all 3 graph outputs for speculative decoding.

Runs the vision encoder (prefill only) before the unified graph and binds the projected soft-token embeddings + scatter indices. Images only appear during prefill (draft_tokens is [batch, 0]); decode steps replay the captured unified graph with the empty defaults, so this pre-pass is a no-op there.

Parameters:: model_inputs (ModelInputs)
Return type:: UnifiedEagleOutputs

`model`

model: Model

source

The compiled unified MTP graph (target + draft + rejection). This is the graph exposed for device graph capture / replay.

`model_config_cls`

model_config_cls

source

alias of Gemma4ForConditionalGenerationConfig

`release()`

release(request_id)

source

Release vision encoder cache for a completed request.

Parameters:: request_id (RequestID)
Return type:: None

`vision_model`

vision_model: Model | None

source

The compiled vision encoder graph, or None for text-only checkpoints. Runs eagerly during prefill (outside the captured graph).

UnifiedMTPGemma4Inputs​

batch_context_lengths​

buffers​

combined_embeds​

combined_indices​

data_parallel_splits​

host_input_row_offsets​

images​

input_row_offsets​

return_n_logits​

signal_buffers​

tokens​

video​

UnifiedMTPGemma4Model​

batch_processor_cls​

calculate_max_seq_len()​

execute()​

model​

model_config_cls​

release()​

vision_model​

`UnifiedMTPGemma4Inputs`

`batch_context_lengths`

`buffers`

`combined_embeds`

`combined_indices`

`data_parallel_splits`

`host_input_row_offsets`

`images`

`input_row_offsets`

`return_n_logits`

`signal_buffers`

`tokens`

`video`

`UnifiedMTPGemma4Model`

`batch_processor_cls`

`calculate_max_seq_len()`

`execute()`

`model`

`model_config_cls`

`release()`

`vision_model`