Skip to main content

Mojo module

mxfp4_preshuffle_loaders

Per-lane DRAM->VGPR loaders for the preshuffled MXFP4 MoE matmul.

Both loaders consume buffers produced by mxfp4_preshuffle_layouts and emit one buffer_load_* per call β€” no LDS round-trip. Each lane reads exactly the fragment / scale word the MFMA needs at its (lane_nlane, lane_klane) slot.

PreshuffledBLoader[N, K_BYTES]: Loads one 16-byte FP4 B fragment per lane via buffer_load_dwordx4, indexed by logical (n, k_byte) through b_5d_layout.

PreshuffledScaleLoader[MN_padded, K_SCALES]: Loads one packed Int32 scale word per lane (4 E8M0 bytes covering MNXdlPack=2 x KXdlPack=2 sub-MMAs) via buffer_load_dword, indexed by logical (mn, k_scale) through scale_4d_layout.

Structs​