IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.diffusion_gemma

DiffusionGemma block-diffusion architecture.

DiffusionGemmaForBlockDiffusionConfig

class max.pipelines.architectures.diffusion_gemma.DiffusionGemmaForBlockDiffusionConfig(*, devices, dtype, unquantized_dtype=bfloat16, kv_params, image_token_index, video_token_index=262144, text_config, vision_config, tie_word_embeddings=False, canvas_length=256, boi_token_id=255999, eoi_token_id=258882)

source

Bases: Gemma4ForConditionalGenerationConfig

Top-level MAX config for DiffusionGemma block diffusion.

Extends the Gemma4 multimodal config with the block-diffusion canvas geometry. KV cache parameters, per-layer-type attention geometry, MoE sizing, and the vision tower config are all inherited from the donor, which reads them through the view installed by initialize_from_config.

Parameters:

boi_token_id

boi_token_id: int = 255999

source

Begin-of-image token id wrapping image prompts.

canvas_length

canvas_length: int = 256

source

Number of tokens denoised per block-diffusion canvas.

eoi_token_id

eoi_token_id: int = 258882

source

End-of-image token id wrapping image prompts.

finalize()

finalize(huggingface_config, state_dict, return_logits)

source

Finalize with state_dict-dependent fields.

Parses quantization config from the weights and finalizes the text sub-config.

Parameters:

  • huggingface_config (AutoConfig) – HuggingFace model configuration.
  • state_dict (dict[str, WeightData]) – Model weights dictionary.
  • return_logits (ReturnLogits) – Return logits configuration.

Return type:

None

initialize_from_config()

classmethod initialize_from_config(pipeline_config, huggingface_config)

source

Initializes from pipeline and HuggingFace configs.

Fields that depend on the state_dict should be set via finalize().

Parameters:

  • pipeline_config (PipelineConfig) – The MAX Engine pipeline configuration.
  • huggingface_config (AutoConfig) – Top-level HuggingFace model configuration.

Returns:

A config instance ready for finalization.

Return type:

Self

DiffusionGemmaForBlockDiffusionModel

class max.pipelines.architectures.diffusion_gemma.DiffusionGemmaForBlockDiffusionModel(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN)

source

Bases: Gemma3_MultiModalModel

Compiles the vision/encoder/decoder graphs for block diffusion.

Differences from the donor pipeline model:

  • load_model uses this port’s weight converters (decoder-canonical checkpoint layout) and compiles an extra decoder graph.
  • execute_decoder_step runs one denoise step: canvas K/V are written into the cache slots after each request’s committed length, so calling it repeatedly without advancing context lengths overwrites the same slots (read-only cache semantics from the encoder’s perspective).

Parameters:

decoder_model

decoder_model: Model

source

execute_decoder_step()

execute_decoder_step(canvas_tokens, input_row_offsets, sc_logits, sc_enabled, temperature, kv_cache_inputs)

source

Runs one denoise step.

Returns (sc_logits_out, argmax, topk_probs, topk_idx, entropy); sc_logits_out is device-resident bf16 and feeds the next step’s sc_logits; the [N, 64] top-k pair feeds host-side categorical sampling.

Parameters:

Return type:

tuple[Buffer, Buffer, Buffer, Buffer, Buffer]

load_model()

load_model(session)

source

Loads the compiled Gemma3 MultiModal models into the MAX Engine session.

Returns:

A tuple of (vision_model, language_model).

Parameters:

session (InferenceSession)

Return type:

tuple[Model, Model]