Mojo module
mxfp4_preshuffle_layouts
Host-side MXFP4 preshuffle layouts for AMD CDNA4 grouped MoE matmul.
Shuffler bundles two layout transforms required by the FP4 MoE matmul
kernel. Both run once on the host at weight-load time (per the load-time-
prep convention).
Shuffler.preshuffle_b_5d:
[E, N, K_BYTES] (row-major, packed FP4) -> flat byte buffer indexed as
(E, N0, K0, KLane=4, NLane=16, KPack=16). Each lane's 16-byte MFMA
fragment lands at a contiguous DRAM address, so B reads go straight
DRAM -> VGPR via buffer_load_dwordx4 with no LDS round-trip.
Shuffler.preshuffle_scale_4d:
[E, MN, K_SCALES] (row-major, E8M0 bytes) -> flat byte buffer indexed
as (E, MN1, K1, XdlKThread=4, XdlMNThread=16, KXdlPack=2, MNXdlPack=2). One i32 lane-load fetches 4 E8M0 scales packed in
(k_pack, mn_pack) order, feeding 4 sub-MMAs via the MFMA opsel byte
selectors.
Layout reference (canonical):
composable_kernel/example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp:73-167
โ preShuffleWeight (B 5D) and preShuffleScale (scale 4D).
Structsโ
- โ
Shuffler: Host-side MXFP4 preshuffle layouts and helpers for AMD CDNA4.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!