IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_preshuffle_layouts

Host-side MXFP4 preshuffle layouts for AMD CDNA4 grouped MoE matmul.

Shuffler bundles two layout transforms required by the FP4 MoE matmul kernel. Both run once on the host at weight-load time (per the load-time- prep convention).

Shuffler.preshuffle_b_5d: [E, N, K_BYTES] (row-major, packed FP4) -> flat byte buffer indexed as (E, N0, K0, KLane=4, NLane=16, KPack=16). Each lane's 16-byte MFMA fragment lands at a contiguous DRAM address, so B reads go straight DRAM -> VGPR via buffer_load_dwordx4 with no LDS round-trip.

Shuffler.preshuffle_scale_4d: [E, MN, K_SCALES] (row-major, E8M0 bytes) -> flat byte buffer indexed as (E, MN1, K1, XdlKThread=4, XdlMNThread=16, KXdlPack=2, MNXdlPack=2). One i32 lane-load fetches 4 E8M0 scales packed in (k_pack, mn_pack) order, feeding 4 sub-MMAs via the MFMA opsel byte selectors.

mxfp4_preshuffle_b_5d_gpu: GPU equivalent of Shuffler.preshuffle_b_5d. LDS-staged so both HBM reads and writes are wave-coalesced. Constant-folded by the graph compiler when called via mo.mxfp4.preshuffle.b.5d on a Constant weight, so the shuffle runs once at session.load instead of every forward pass.

Layout reference (canonical): composable_kernel/example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp:73-167 โ€” preShuffleWeight (B 5D) and preShuffleScale (scale 4D).

Structsโ€‹