Mojo module
mxfp4_moe_matmul_amd
MXFP4 x MXFP4 routed MoE matmul kernel for AMD CDNA4.
MXFP4MoERoutedMatmul / mxfp4_moe_matmul_amd_routed is the full
routed MoE matmul. block_idx.y walks per-expert sort blocks, decodes
sorted_token_ids per row to gather A from original token order, and
scatters output to c[t*topk + s, :]. It's a drop-in replacement for
the gather + grouped-matmul + scatter pipeline.
Data layouts:
A: [num_tokens, K_BYTES] uint8, FP4 packed two-per-byte, row-major.
B: 5D-preshuffled (see mxfp4_preshuffle_layouts.b_5d_grouped_layout).
sfa, sfb: 4D-preshuffled E8M0 scale bytes (scale_4d_grouped_layout).
C: [num_tokens * topk, N] fp32, row-major.
For the MFMA scale convention (per-lane scale i32, OPSEL-selected byte
applied to the lane's (M=lane%16, K-group=lane/16) slot) see the AMD
CDNA4 ISA section 7.2.1.
Structsβ
- β
InputRowMode: Selects how the kernel decodes A's row index fromsorted_token_ids. - β
MXFP4MoERoutedMatmul:
Functionsβ
- β
mxfp4_moe_matmul_amd_routed: Launches the routed MXFP4xMXFP4 matmul on AMD CDNA4. - β
mxfp4_moe_matmul_amd_routed_dispatch: Dispatches the routed kernel to a tile shape based onmax_tokens_per_expert.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!