For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_preshuffle_layouts

Host-side MXFP4 preshuffle layouts for AMD CDNA4 grouped MoE matmul.

Shuffler bundles two layout transforms required by the FP4 MoE matmul kernel. Both run once on the host at weight-load time (per the load-time- prep convention).

Shuffler.preshuffle_b_5d: [E, N, K_BYTES] (row-major, packed FP4) -> flat byte buffer indexed as (E, N0, K0, KLane=4, NLane=16, KPack=16). Each lane's 16-byte MFMA fragment lands at a contiguous DRAM address, so B reads go straight DRAM -> VGPR via buffer_load_dwordx4 with no LDS round-trip.

Shuffler.preshuffle_scale_4d: [E, MN, K_SCALES] (row-major, E8M0 bytes) -> flat byte buffer indexed as (E, MN1, K1, XdlKThread=4, XdlMNThread=16, KXdlPack=2, MNXdlPack=2). One i32 lane-load fetches 4 E8M0 scales packed in (k_pack, mn_pack) order, feeding 4 sub-MMAs via the MFMA opsel byte selectors.

mxfp4_preshuffle_b_5d_gpu: GPU equivalent of Shuffler.preshuffle_b_5d. LDS-staged so both HBM reads and writes are wave-coalesced. Constant-folded by the graph compiler when called via mo.mxfp4.preshuffle.b.5d on a Constant weight, so the shuffle runs once at session.load instead of every forward pass.

Layout reference (canonical): composable_kernel/example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp:73-167 — preShuffleWeight (B 5D) and preShuffleScale (scale 4D).

Structs

Shuffler: MXFP4 preshuffle layouts and helpers for AMD CDNA4.

Structs​

Structs