Skip to main content

Mojo module

mxfp4_moe_matmul_amd

MXFP4 x MXFP4 routed MoE matmul kernel for AMD CDNA4.

MXFP4MoERoutedMatmul / mxfp4_moe_matmul_amd_routed is the full routed MoE matmul. block_idx.y walks per-expert sort blocks, decodes sorted_token_ids per row to gather A from original token order, and scatters output to c[t*topk + s, :]. It's a drop-in replacement for the gather + grouped-matmul + scatter pipeline.

Data layouts: A: [num_tokens, K_BYTES] uint8, FP4 packed two-per-byte, row-major. B: 5D-preshuffled (see mxfp4_preshuffle_layouts.b_5d_grouped_layout). sfa, sfb: 4D-preshuffled E8M0 scale bytes (scale_4d_grouped_layout). C: [num_tokens * topk, N] fp32, row-major.

For the MFMA scale convention (per-lane scale i32, OPSEL-selected byte applied to the lane's (M=lane%16, K-group=lane/16) slot) see the AMD CDNA4 ISA section 7.2.1.

Structs​

Functions​