For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
mla_decode_max_seq_len
def mla_decode_max_seq_len[dtype: DType, num_heads: Int]() -> Int
Max query tokens (S) the MLA decode branch can fold for this config.
On AMD the S>1 fold is FP8-only AND num_heads<=AMD_MLA_DECODE_FOLD_MAX_NUM_HEADS
only (the larger-head dispatch arm is S=1-only), so a BF16 cache or num_heads>16
routes S>1 to MLA prefill; NVIDIA folds up to MLA_DECODE_MAX_SEQ_LEN for both.
Mirrors the host-side fold gate in flare_mla_decoding_dispatch so the router
never hands the gate an S>1 batch it would reject.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!