For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.glm5_1

GLM-5.1 (GlmMoeDsa) mixture-of-experts architecture for text generation.

`Glm5_1Config`

class max.pipelines.architectures.glm5_1.Glm5_1Config(*, dtype, kv_params, devices, use_subgraphs=True, data_parallel_degree=1, vocab_size=129280, hidden_size=7168, intermediate_size=18432, moe_intermediate_size=2048, moe_layer_freq=1, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_shared_experts=1, n_routed_experts=256, routed_scaling_factor=2.5, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, v_head_dim=128, qk_nope_head_dim=128, topk_method='greedy', n_group=8, topk_group=4, num_experts_per_tok=8, first_k_dense_replace=3, norm_topk_prob=True, hidden_act='silu', max_position_embeddings=4096, max_seq_len=163840, rms_norm_eps=1e-06, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, rope_interleave=True, scoring_func='sigmoid', attention_bias=False, attention_dropout=0.0, norm_dtype=bfloat16, gate_dtype=None, correction_bias_dtype=None, max_batch_context_length=131072, quant_config=None, dense_mlp_layers_without_quant=frozenset({}), ep_config=None, graph_mode='auto', return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE, eagle_aux_hidden_state_layer_ids=None, eplb_profile_enabled=False, index_head_dim=128, index_n_heads=64, index_topk=2048, indexer_types=<factory>)

source

Bases: DeepseekV3_2Config

Configuration for GLM-5.1 models.

Skeleton alias of DeepseekV3_2Config until GLM-specific bring-up diverges from DeepSeek-V3.2.

Parameters:

dtype (DType)
kv_params (KVCacheParamInterface)
devices (list[DeviceRef])
use_subgraphs (bool)
data_parallel_degree (int)
vocab_size (int)
hidden_size (int)
intermediate_size (int)
moe_intermediate_size (int)
moe_layer_freq (int)
num_hidden_layers (int)
num_attention_heads (int)
num_key_value_heads (int)
n_shared_experts (int)
n_routed_experts (int)
routed_scaling_factor (float)
kv_lora_rank (int)
q_lora_rank (int)
qk_rope_head_dim (int)
v_head_dim (int)
qk_nope_head_dim (int)
topk_method (str)
n_group (int)
topk_group (int)
num_experts_per_tok (int)
first_k_dense_replace (int)
norm_topk_prob (bool)
hidden_act (str)
max_position_embeddings (int)
max_seq_len (int)
rms_norm_eps (float)
tie_word_embeddings (bool)
rope_theta (float)
rope_scaling (dict[str, Any] | None)
rope_interleave (bool)
scoring_func (str)
attention_bias (bool)
attention_dropout (float)
norm_dtype (DType)
gate_dtype (DType | None)
correction_bias_dtype (DType | None)
max_batch_context_length (int)
quant_config (QuantConfig | None)
dense_mlp_layers_without_quant (frozenset[int])
ep_config (EPConfig | None)
graph_mode (str)
return_logits (ReturnLogits)
return_hidden_states (ReturnHiddenStates)
eagle_aux_hidden_state_layer_ids (list[int] | None)
eplb_profile_enabled (bool)
index_head_dim (int)
index_n_heads (int)
index_topk (int)
indexer_types (list[str])

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initialize config, mapping GLM default RoPE to rope_scaling=None.

Parameters:

pipeline_config (PipelineConfig)
model_config (MAXModelConfig | None)

Return type:

Self

`Glm5_1Model`

class max.pipelines.architectures.glm5_1.Glm5_1Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.ALL, return_hidden_states=ReturnHiddenStates.NONE, max_batch_size=1)

source

Bases: DeepseekV3_2Model

GLM-5.1 pipeline model.

Skeleton alias of DeepseekV3_2Model until GLM-specific bring-up diverges from DeepSeek-V3.2.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (WeightsAdapter | None)
return_logits (ReturnLogits)
return_hidden_states (ReturnHiddenStates)
max_batch_size (int)

`model_config_cls`

model_config_cls

source

alias of Glm5_1Config

`GlmReasoningParser`

class max.pipelines.architectures.glm5_1.GlmReasoningParser(think_start_token_id, think_end_token_id, tool_call_start_token_id=None)

source

Bases: ReasoningParser

GLM-4.5+ (GLM-5.1 / GLM-5.2) reasoning parser for <think> sections.

GLM’s chat template appends <think> to every assistant turn via add_generation_prompt when thinking is enabled (the default), so reasoning begins implicitly without an explicit <think> token in the model output stream. Reasoning ends explicitly at </think>, or implicitly when a tool call begins (<tool_call>) — the tool-call marker is left in the content region for the tool parser to consume.

Mirrors the Qwen 3.5 parser (same <think> prefill semantics); only the delimiter tokens differ.

Parameters:

think_start_token_id (int)
think_end_token_id (int)
tool_call_start_token_id (int | None)

`from_tokenizer()`

async classmethod from_tokenizer(tokenizer)

source

Construct a reasoning parser from a tokenizer.

Parameters:: tokenizer (PipelineTokenizer[Any, Any, Any])
Return type:: GlmReasoningParser

`reasoning_end_token_id()`

async classmethod reasoning_end_token_id(tokenizer)

source

Returns the </think> token id that closes a reasoning span.

Parameters:: tokenizer (PipelineTokenizer[Any, Any, Any])
Return type:: int | None

`stream()`

stream(delta_token_ids, is_currently_reasoning=True)

source

Identify a reasoning span within a streaming delta chunk.

Parameters:

delta_token_ids (Sequence[int])
is_currently_reasoning (bool)

Return type:

ParsedReasoningDelta

`will_reason_after_prompt()`

will_reason_after_prompt(prompt_token_ids)

source

Decide whether the next generated token continues a reasoning span.

The chat template embeds a literal <tool_call> example in the tool instructions, and <tool_call> is a reasoning-end delimiter, so a left-to-right scan would falsely conclude reasoning already ended. Multi-turn prompts also carry <think>/</think> from prior turns. Scan right-to-left: the last delimiter before generation is the chat template’s prefilled <think>.

Parameters:: prompt_token_ids (Sequence[int])
Return type:: bool

`GlmTokenizer`

class max.pipelines.architectures.glm5_1.GlmTokenizer(model_path, pipeline_config, *, revision=None, max_length=None, trust_remote_code=False, enable_llama_whitespace_fix=False, chat_template=None, **unused_kwargs)

source

Bases: TextTokenizer

Text tokenizer for GLM-4.5+ (GLM-5.1 / GLM-5.2).

Identical to TextTokenizer but also implements ReasoningPipelineTokenizer by resolving the <think>/</think> delimiter token IDs at construction. The overlap (speculative/MTP) text-generation pipeline requires these ids on the tokenizer when a reasoning_parser is configured.

Parameters:

model_path (str)
pipeline_config (PipelineConfig)
revision (str | None)
max_length (int | None)
trust_remote_code (bool)
enable_llama_whitespace_fix (bool)
chat_template (str | None)
unused_kwargs (Any)

`reasoning_end_token_id`

property reasoning_end_token_id: int

source

Token id of </think> (closes a GLM reasoning span).

`reasoning_start_token_id`

property reasoning_start_token_id: int

source

Token id of <think> (opens a GLM reasoning span).

`GlmToolParser`

class max.pipelines.architectures.glm5_1.GlmToolParser

source

Bases: StructuralTagToolParser

Parses GLM-4.5+ (GLM-5.1 / GLM-5.2) tool calls.

Flat layout: only CALL_BEGIN/CALL_END are set, so the base class scans for <tool_call> … </tool_call> pairs directly. Within each call the function name precedes the first <arg_key>; the remainder is parameter XML that we convert to growing JSON for streaming.

`CALL_BEGIN`

CALL_BEGIN: ClassVar[str] = '<tool_call>'

source

`CALL_END`

CALL_END: ClassVar[str] = '</tool_call>'

source

`generate_tool_call_grammar()`

static generate_tool_call_grammar(response_format_schema=None, tools=None, tokenizer=None, **kwargs)

source

Generates a Lark grammar for GLM tool-call constrained decoding.

Special tokens are referenced by ID (<[N]>) so multi-byte literal matches don’t trip llguidance. The envelope and <arg_key>/ <arg_value> framing are always constrained, and the calls must end on a turn-ender token (<|observation|>/<|user|>/ <|endoftext|>) so the grammar closes instead of looping.

When a tool supplies a parameters schema, arguments are constrained to it: <arg_key> is restricted to the declared property names, required properties must appear, and each <arg_value> is constrained to its property type — bare for strings (with enum / pattern support), and via %json over the sub-schema for every other type (numbers, booleans, nested objects/arrays, etc.). Tools with no properties schema fall back to permissive (valid-structure) args. When response_format_schema is provided an alternative JSON branch matching the schema is added.

Not enforced for string values: maxLength / format (GLM strings are bare, so JSON-schema string facets beyond pattern aren’t applied); numeric/object facets rely on %json coverage.

Parameters:

response_format_schema (dict[str, Any] | None) – Optional JSON schema dict. When provided, the grammar also accepts a JSON response matching the schema.
tools (list[dict[str, Any]] | None) – Optional list of OpenAI-style tool dicts. None accepts any tool name.
tokenizer (PipelineTokenizer[Any, Any, Any] | None) – Tokenizer used to resolve GLM special-token IDs.
**kwargs (Any) – Ignored (accepts backend, tool_choice, etc.).

Returns:

A Lark grammar string for the constrained-decoding backend.

Return type:

str

Glm5_1Config​

initialize()​

Glm5_1Model​

model_config_cls​

GlmReasoningParser​

from_tokenizer()​

reasoning_end_token_id()​

stream()​

will_reason_after_prompt()​

GlmTokenizer​

reasoning_end_token_id​

reasoning_start_token_id​

GlmToolParser​

CALL_BEGIN​

CALL_END​

generate_tool_call_grammar()​

`Glm5_1Config`

`initialize()`

`Glm5_1Model`

`model_config_cls`

`GlmReasoningParser`

`from_tokenizer()`

`reasoning_end_token_id()`

`stream()`

`will_reason_after_prompt()`

`GlmTokenizer`

`reasoning_end_token_id`

`reasoning_start_token_id`

`GlmToolParser`

`CALL_BEGIN`

`CALL_END`

`generate_tool_call_grammar()`