For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
mla
comptime valuesβ
AMD_MLA_DECODE_FOLD_M_MAXβ
comptime AMD_MLA_DECODE_FOLD_M_MAX = 128
AMD_MLA_DECODE_FOLD_MAX_NUM_HEADSβ
comptime AMD_MLA_DECODE_FOLD_MAX_NUM_HEADS = 16
MLA_DECODE_MAX_SEQ_LENβ
comptime MLA_DECODE_MAX_SEQ_LEN = 8
Functionsβ
- β
copy_fn_unified: - β
flare_mla_decoding: MLA decoding kernel that would only be called in the optimized compute graph. - β
flare_mla_decoding_dispatch: - β
flare_mla_prefill: MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs. - β
flare_mla_prefill_dispatch: - β
mla_decode_max_seq_len: Max query tokens (S) the MLA decode branch can fold for this config. - β
mla_decoding: - β
mla_decoding_single_batch: Flash attention v2 algorithm. - β
mla_prefill: - β
mla_prefill_plan: This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer. - β
mla_prefill_plan_kernel: - β
mla_prefill_single_batch: MLA for encoding where seqlen > 1. - β
mla_splitk_reduce: - β
q_block_idx: - β
set_buffer_lengths_to_zero:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!