IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mla_components

MLA-prefill math components for AMD MI355X (gfx950).

MlaPrefillV2-owned home of the MLA-prefill numeric closure that mla_prefill_v2.mojo consumes: the MlaPrefillV2Core[config] struct (trimmed to the methods/constants _attend_exact and the host launcher actually reference), the _MlaKDmaPair single-base K DMA helper, and the module-level scheduling primitives (_sched_barrier_zero, _s_barrier_raw, _s_setprio).

The MLA-prefill MATH (QK with nope d=128 + rope d=64; FlashAttention-2 online softmax via OnlineSoftmax; in-place FP8 P collapse; PV accumulate; normalize + store; causal / null mask) lives here so the inner loop can reuse the verified primitives while emitting its own cluster cadence in mla_prefill_v2.mojo. The shared building blocks (MhaMmaOp/MlaConfigV2 from mha_mma_op.mojo, the OnlineSoftmax recurrence, the MaskApplier mask dispatch, the SubTileLoaderLDS_* loaders from amd_tile_io.mojo) are imported in place; this module owns only the MLA-prefill-specific assembly.

MlaPrefillV2Core's _FP32_SOFTMAX_SCORES gate (FP8 + KV>=128 + 32x32x64) is the default-on path for the FP8 / KV>=128 / 32x32x64 shape, so every reused primitive exercises the codegen MlaPrefillV2 ships.

Structs