IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.qwen3_5

Qwen3_5Config​

class max.pipelines.architectures.qwen3_5.Qwen3_5Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, sliding_window=None, layer_types=<factory>, full_attention_interval=4, linear_key_head_dim=128, linear_value_head_dim=128, linear_num_key_heads=16, linear_num_value_heads=48, linear_conv_kernel_dim=4, partial_rotary_factor=0.25, attn_output_gate=True, mamba_ssm_dtype=float32, vision_config=None, image_token_id=None, video_token_id=None, vision_start_token_id=None, mrope_section=None)

source

Bases: Llama3Config

Configuration for Qwen3.5 hybrid attention models.

Qwen3.5 uses a hybrid architecture with both full (standard) attention and linear attention (Gated DeltaNet) layers. Every full_attention_interval-th layer uses full attention, and the rest use linear attention.

Parameters:

attn_output_gate​

attn_output_gate: bool = True

source

Whether full attention layers use a sigmoid output gate.

calculate_attention_multiplier()​

static calculate_attention_multiplier(huggingface_config)

source

Compute attention scaling factor using explicit head_dim.

Parameters:

huggingface_config (AutoConfig)

Return type:

float

construct_kv_params()​

static construct_kv_params(huggingface_config, pipeline_config, devices, kv_cache_config, cache_dtype)

source

Construct KV cache parameters for full attention layers only.

Only allocates KV cache entries for full-attention layers; linear attention layers use separate conv/recurrent state buffers instead. The forward pass maps each full-attention layer to a sequential KV cache index (0, 1, 2, …) independent of the absolute layer index.

Parameters:

Return type:

KVCacheParams

full_attention_interval​

full_attention_interval: int = 4

source

Every N-th layer uses full attention.

get_num_layers()​

static get_num_layers(huggingface_config)

source

Layer count for the decoder stack (override when HF uses a different field).

Parameters:

huggingface_config (AutoConfig)

Return type:

int

image_token_id​

image_token_id: int | None = None

source

Token ID used for image placeholders in the input sequence.

infer_optimal_batch_size()​

infer_optimal_batch_size(devices, *, weights_size, device_memory_utilization)

source

Return a memory-safe default max_batch_size for this architecture.

Qwen3.5 stores GatedDeltaNet conv and recurrent state in a single max_batch x per_req pool that the slot-indexed SSM kernels mutate in place. There are no working copies, so peak footprint is max_batch x per_req bytes.

We split the post-weights utilization budget evenly: the state pool gets up to half, the KV cache absorbs the rest. This uses the same device_memory_utilization headroom factor as the rest of the pipeline, and matches the estimate_activation_memory() reservation.

Falls back to 32β€”safe for the 27B model on H100/A100 (80 GB)β€”when the device query fails.

Parameters:

Return type:

int

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Initialize config from pipeline and HuggingFace configurations.

Handles both multimodal (Qwen3_5ForConditionalGeneration) and text-only (Qwen3_5ForCausalLM) configs by extracting the text config.

Parameters:

Return type:

Self

layer_types​

layer_types: list[str]

source

β€˜full_attention’ or β€˜linear_attention’.

Type:

Per-layer attention type

linear_conv_kernel_dim​

linear_conv_kernel_dim: int = 4

source

Causal conv1d kernel size for linear attention layers.

linear_key_head_dim​

linear_key_head_dim: int = 128

source

Key head dimension for linear attention layers.

linear_num_key_heads​

linear_num_key_heads: int = 16

source

Number of key heads for linear attention layers.

linear_num_value_heads​

linear_num_value_heads: int = 48

source

Number of value heads for linear attention layers.

linear_value_head_dim​

linear_value_head_dim: int = 128

source

Value head dimension for linear attention layers.

mamba_ssm_dtype​

mamba_ssm_dtype: DType = 81

source

Dtype for SSM (state space model) computations in linear attention layers.

mrope_section​

mrope_section: list[int] | None = None

source

MRoPE section lengths for multimodal rotary position encoding.

partial_rotary_factor​

partial_rotary_factor: float = 0.25

source

Fraction of head_dim that gets rotary position embedding.

video_token_id​

video_token_id: int | None = None

source

Token ID used for video placeholders in the input sequence.

vision_config​

vision_config: VisionConfig | None = None

source

Vision encoder configuration; None for text-only models.

vision_start_token_id​

vision_start_token_id: int | None = None

source

Token ID that marks the start of vision content.

Qwen3_5Inputs​

class max.pipelines.architectures.qwen3_5.Qwen3_5Inputs(tokens, input_row_offsets, signal_buffers, return_n_logits, data_parallel_splits=None, slot_idx=None, conv_pools=None, recurrent_pools=None, request_ids=None, image_token_indices=None, pixel_values=None, vision_position_ids=None, weights=None, indices=None, max_grid_size=None, grid_thw=None, cu_seqlens=None, max_seqlen=None, lm_image_embeddings=None, *, kv_cache_inputs=None, lora=None, hidden_states=None)

source

Bases: Llama3Inputs

Inputs for Qwen3.5 including linear attention states and optional vision inputs.

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

conv_pools​

conv_pools: list[Buffer] | None = None

source

Per-layer mutable conv pool, [max_slots, conv_dim, K-1].

cu_seqlens​

cu_seqlens: Buffer | None = None

source

Cumulative sequence lengths for vision full attention.

grid_thw​

grid_thw: Buffer | None = None

source

Grid dimensions (temporal, height, width) per image, shape (n_images, 3).

has_vision_inputs​

property has_vision_inputs: bool

source

True when pixel values are available for vision encoding.

image_token_indices​

image_token_indices: Buffer | None = None

source

Pre-computed scatter indices for image embeddings.

indices​

indices: Buffer | None = None

source

Bilinear interpolation indices for vision position embeddings.

lm_image_embeddings​

lm_image_embeddings: Buffer | None = None

source

Image embeddings for the LM graph (empty [0, H] buffer for decode/text-only steps, real embeddings for prefill steps with images). Must be non-None for multimodal models.

max_grid_size​

max_grid_size: Buffer | None = None

source

Maximum grid size (CPU scalar) for vision attention.

max_seqlen​

max_seqlen: Buffer | None = None

source

Maximum sequence length (CPU scalar) for vision attention.

pixel_values​

pixel_values: Buffer | None = None

source

Raw pixel values for vision encoding.

recurrent_pools​

recurrent_pools: list[Buffer] | None = None

source

Per-layer mutable recurrent pool, [max_slots, nv, KD, VD].

request_ids​

request_ids: list[RequestID] | None = None

source

Request IDs for this batch, used to manage per-request state cache slots.

slot_idx​

slot_idx: Buffer | None = None

source

Per-batch [B] uint32 slot indices into the linear-attention pools.

vision_position_ids​

vision_position_ids: Buffer | None = None

source

Rotary position IDs for the vision encoder.

weights​

weights: Buffer | None = None

source

Bilinear interpolation weights for vision position embeddings.

Qwen3_5Model​

class max.pipelines.architectures.qwen3_5.Qwen3_5Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: AlwaysSignalBuffersMixin, LlamaModelBase

Qwen3.5 pipeline model implementation.

Supports the hybrid linear/full attention architecture with KV cache for full attention layers and conv/recurrent states for linear layers.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

calculate_max_seq_len()​

classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

source

Calculates the optimal max sequence length for the model.

Default implementation delegates to model_config_cls. Override when pipeline-model semantics differ from the config (for example, bounding max_length where the config is permissive).

Parameters:

  • pipeline_config (PipelineConfig) – Configuration for the pipeline.
  • huggingface_config (AutoConfig) – Hugging Face model configuration.

Returns:

The maximum sequence length to use.

Return type:

int

execute()​

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

load_model()​

load_model(session)

source

Parameters:

session (InferenceSession)

Return type:

Model

model​

model: Model

source

Compiled and initialized model ready for inference.

model_config_cls​

model_config_cls

source

alias of Qwen3_5Config

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

Qwen3_5Inputs

release()​

release(request_id)

source

Release per-request state cache slot when a request completes.

Parameters:

request_id (RequestID)

Return type:

None

state_dict​

state_dict: dict[str, Any]

source

Weights to load into the model.

vision_model​

vision_model: Model | None = None

source

Qwen3_5ReasoningParser​

class max.pipelines.architectures.qwen3_5.Qwen3_5ReasoningParser(think_start_token_id, think_end_token_id, tool_call_start_token_id=None)

source

Bases: ReasoningParser

Qwen 3.5 / 3.6 reasoning parser for <think>...</think> sections.

Qwen 3.5/3.6’s chat template prepends <think>\n to every assistant turn when enable_thinking is true (the default), so reasoning begins implicitly without an explicit <think> token in the model output stream. Reasoning ends explicitly at </think>, or implicitly when a tool call begins (<tool_call>) β€” the tool-call marker is left in the content region for the tool parser to consume.

Parameters:

  • think_start_token_id (int)
  • think_end_token_id (int)
  • tool_call_start_token_id (int | None)

from_tokenizer()​

async classmethod from_tokenizer(tokenizer)

source

Construct a reasoning parser from a tokenizer.

Parameters:

tokenizer (PipelineTokenizer[Any, Any, Any])

Return type:

Qwen3_5ReasoningParser

reasoning_end_token_id()​

async classmethod reasoning_end_token_id(tokenizer)

source

Returns the </think> token id that closes a reasoning span.

Parameters:

tokenizer (PipelineTokenizer[Any, Any, Any])

Return type:

int | None

stream()​

stream(delta_token_ids, is_currently_reasoning=True)

source

Identify a reasoning span within a streaming delta chunk.

When is_currently_reasoning=False and the chunk contains no <think> opener, returns an empty span so post-reasoning content chunks aren’t misclassified as reasoning.

Parameters:

Return type:

ParsedReasoningDelta

will_reason_after_prompt()​

will_reason_after_prompt(prompt_token_ids)

source

Decide whether the next generated token continues a reasoning span.

Overrides the ABC default (which delegates to stream scanning left-to-right). That default is wrong for Qwen: the chat template embeds a literal <tool_call> example in the tool instructions, and <tool_call> is a reasoning-end delimiter β€” so a left-to-right scan hits the example and falsely concludes reasoning already ended, leaking the model’s <think> block into content.

Multi-turn prompts can also contain <think>/</think> tokens from prior assistant turns; only the most-recently-emitted delimiter describes the current state. Scan right-to-left: the last delimiter before generation is the chat template’s prefilled <think>.

Parameters:

prompt_token_ids (Sequence[int])

Return type:

bool

Qwen3_5ToolParser​

class max.pipelines.architectures.qwen3_5.Qwen3_5ToolParser

source

Bases: object

Parser for Qwen 3.5 / 3.6 tool calls.

parse_complete()​

parse_complete(response)

source

Parse a complete model response into tool calls.

Parameters:

response (str)

Return type:

ParsedToolResponse

parse_delta()​

parse_delta(delta)

source

Incrementally process one decoded-token delta.

Returns content text to forward to the client and any tool-call increments to emit, in the order they were produced. Content deltas have content set; tool-call deltas have one or more of id / name / arguments set.

Parameters:

delta (str)

Return type:

list[ParsedToolCallDelta] | None

reset()​

reset()

source

Reset internal state for a new streaming session.

Return type:

None