Skip to main content

Mojo module

mxfp4_preshuffle_layouts

Host-side MXFP4 preshuffle layouts for AMD CDNA4 grouped MoE matmul.

Shuffler bundles two layout transforms required by the FP4 MoE matmul kernel. Both run once on the host at weight-load time (per the load-time- prep convention).

Shuffler.preshuffle_b_5d: [E, N, K_BYTES] (row-major, packed FP4) -> flat byte buffer indexed as (E, N0, K0, KLane=4, NLane=16, KPack=16). Each lane's 16-byte MFMA fragment lands at a contiguous DRAM address, so B reads go straight DRAM -> VGPR via buffer_load_dwordx4 with no LDS round-trip.

Shuffler.preshuffle_scale_4d: [E, MN, K_SCALES] (row-major, E8M0 bytes) -> flat byte buffer indexed as (E, MN1, K1, XdlKThread=4, XdlMNThread=16, KXdlPack=2, MNXdlPack=2). One i32 lane-load fetches 4 E8M0 scales packed in (k_pack, mn_pack) order, feeding 4 sub-MMAs via the MFMA opsel byte selectors.

Layout reference (canonical): composable_kernel/example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp:73-167 โ€” preShuffleWeight (B 5D) and preShuffleScale (scale 4D).

Structsโ€‹

  • โ€‹Shuffler: Host-side MXFP4 preshuffle layouts and helpers for AMD CDNA4.