For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python module
max.pipelines.architectures.unified_mtp_glm5_2
GLM-5.2 (DeepSeek-V3.2 sparse) MTP draft model for unified speculative decoding.
UnifiedMTPGlm5_2Inputsβ
class max.pipelines.architectures.unified_mtp_glm5_2.UnifiedMTPGlm5_2Inputs(tokens, input_row_offsets, signal_buffers, host_input_row_offsets, batch_context_lengths, *, kv_cache_inputs=None, lora=None, hidden_states=None, return_n_logits, data_parallel_splits, ep_inputs=(), draft_tokens=None, seed=None, temperature=None, top_k=None, max_k=None, top_p=None, min_top_p=None, in_thinking_phase=None, pinned_bitmask=None, wait_payload=None, device_bitmask_scratch=None, structured_output=False)
Bases: UnifiedSpecDecodeInputs, DeepseekV3Inputs
Inputs for the UnifiedMTPGlm5_2 model.
Target-prefix fields come from DeepseekV3Inputs; the spec-decode
fields and trailing buffer packing come from
UnifiedSpecDecodeInputs. The MTP graph binds the per-row
in_thinking_phase flag (consumed by relaxed acceptance).
-
Parameters:
-
- tokens (Buffer)
- input_row_offsets (Buffer)
- signal_buffers (list[Buffer])
- host_input_row_offsets (Buffer)
- batch_context_lengths (list[Buffer])
- kv_cache_inputs (KVCacheInputsInterface[Buffer, Buffer] | None)
- lora (LoRAInputs | None)
- hidden_states (Buffer | list[Buffer] | None)
- return_n_logits (Buffer)
- data_parallel_splits (Buffer)
- ep_inputs (tuple[Buffer, ...])
- draft_tokens (Buffer | None)
- seed (Buffer | None)
- temperature (Buffer | None)
- top_k (Buffer | None)
- max_k (Buffer | None)
- top_p (Buffer | None)
- min_top_p (Buffer | None)
- in_thinking_phase (Buffer | None)
- pinned_bitmask (Buffer | None)
- wait_payload (Buffer | None)
- device_bitmask_scratch (Buffer | None)
- structured_output (bool)
buffersβ
Returns positional Buffer inputs for model ABI calls.
UnifiedMTPGlm5_2Modelβ
class max.pipelines.architectures.unified_mtp_glm5_2.UnifiedMTPGlm5_2Model(*args, **kwargs)
Bases: _UnifiedSpecDecodeModelMixin, Glm5_1Model
GLM-5.2 with MTP: merge + V3.2 target + rejection + sparse draft.
batch_processor_clsβ
batch_processor_cls
alias of UnifiedMTPGlm5_2BatchProcessor
load_model()β
load_model(session)
Load the model with the given weights.
-
Parameters:
-
session (InferenceSession)
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!