For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

mla_decode_max_seq_len

def mla_decode_max_seq_len[dtype: DType, num_heads: Int]() -> Int

Max query tokens (S) the MLA decode branch can fold for this config.

On AMD the S>1 fold is FP8-only AND num_heads<=AMD_MLA_DECODE_FOLD_MAX_NUM_HEADS only (the larger-head dispatch arm is S=1-only), so a BF16 cache or num_heads>16 routes S>1 to MLA prefill; NVIDIA folds up to MLA_DECODE_MAX_SEQ_LEN for both. Mirrors the host-side fold gate in flare_mla_decoding_dispatch so the router never hands the gate an S>1 batch it would reject.

Returns:

Int