For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_moe_matmul_amd

MXFP4 x MXFP4 routed MoE matmul kernel for AMD CDNA4.

MXFP4MoERoutedMatmul / mxfp4_moe_matmul_amd_routed is the full routed MoE matmul. block_idx.y walks per-expert sort blocks, decodes sorted_token_ids per row to gather A from original token order, and scatters output to c[t*topk + s, :]. It's a drop-in replacement for the gather + grouped-matmul + scatter pipeline.

Data layouts: A: [num_tokens, K_BYTES] uint8, FP4 packed two-per-byte, row-major. B: 5D-preshuffled (see mxfp4_preshuffle_layouts.b_5d_grouped_layout). sfa, sfb: 4D-preshuffled E8M0 scale bytes (scale_4d_grouped_layout). C: [num_tokens * topk, N] fp32, row-major.

For the MFMA scale convention (per-lane scale i32, OPSEL-selected byte applied to the lane's (M=lane%16, K-group=lane/16) slot) see the AMD CDNA4 ISA section 7.2.1.

Structs

InputRowMode: Selects how the kernel decodes A's row index from sorted_token_ids.
MXFP4MoERoutedMatmul:

Functions

mxfp4_moe_matmul_amd_routed: Launches the routed MXFP4xMXFP4 matmul on AMD CDNA4.
mxfp4_moe_matmul_amd_routed_dispatch: Dispatches the routed kernel to a tile shape based on max_tokens_per_expert.

Structs​

Functions​

Structs

Functions