IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_moe_matmul_amd

MXFP4 x MXFP4 routed MoE matmul kernel for AMD CDNA4.

MXFP4MoERoutedMatmul / mxfp4_moe_matmul_amd_routed is the full routed MoE matmul. block_idx.y walks per-expert sort blocks, decodes sorted_token_ids per row to gather A from original token order, and scatters output to c[t*topk + s, :]. It's a drop-in replacement for the gather + grouped-matmul + scatter pipeline.

Data layouts: A: [num_tokens, K_BYTES] uint8, FP4 packed two-per-byte, row-major. B: 5D-preshuffled (see mxfp4_preshuffle_layouts.b_5d_grouped_layout). sfa, sfb: 4D-preshuffled E8M0 scale bytes (scale_4d_grouped_layout). C: [num_tokens * topk, N] fp32, row-major.

For the MFMA scale convention (per-lane scale i32, OPSEL-selected byte applied to the lane's (M=lane%16, K-group=lane/16) slot) see the AMD CDNA4 ISA section 7.2.1.

Structs​

Functions​