Mojo module
mxfp4_preshuffle_loaders
Per-lane DRAM->VGPR loaders for the preshuffled MXFP4 MoE matmul.
Both loaders consume buffers produced by mxfp4_preshuffle_layouts and emit
one buffer_load_* per call β no LDS round-trip. Each lane reads exactly the
fragment / scale word the MFMA needs at its (lane_nlane, lane_klane) slot.
PreshuffledBLoader[N, K_BYTES]:
Loads one 16-byte FP4 B fragment per lane via buffer_load_dwordx4,
indexed by logical (n, k_byte) through b_5d_layout.
PreshuffledScaleLoader[MN_padded, K_SCALES]:
Loads one packed Int32 scale word per lane (4 E8M0 bytes covering
MNXdlPack=2 x KXdlPack=2 sub-MMAs) via buffer_load_dword, indexed
by logical (mn, k_scale) through scale_4d_layout.
Structsβ
- β
PreshuffledBLoader: Per-lane B fragment loader from preshuffled GMEM (DRAM -> VGPR direct). - β
PreshuffledScaleLoader: Per-lane packed-Int32 scale loader from preshuffled GMEM.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!