IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

mla_decode_max_seq_len

def mla_decode_max_seq_len[dtype: DType, num_heads: Int]() -> Int

Max query tokens (S) the MLA decode branch can fold for this config.

On AMD the S>1 fold is FP8-only AND num_heads<=AMD_MLA_DECODE_FOLD_MAX_NUM_HEADS only (the larger-head dispatch arm is S=1-only), so a BF16 cache or num_heads>16 routes S>1 to MLA prefill; NVIDIA folds up to MLA_DECODE_MAX_SEQ_LEN for both. Mirrors the host-side fold gate in flare_mla_decoding_dispatch so the router never hands the gate an S>1 batch it would reject.

Returns:

Int