IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_preshuffle_loaders

Per-lane DRAM->VGPR loaders for the preshuffled MXFP4 MoE matmul.

Both loaders consume buffers produced by mxfp4_preshuffle_layouts and emit one buffer_load_* per call β€” no LDS round-trip. Each lane reads exactly the fragment / scale word the MFMA needs at its (lane_nlane, lane_klane) slot.

PreshuffledBLoader[N, K_BYTES]: Loads one 16-byte FP4 B fragment per lane via buffer_load_dwordx4, indexed by logical (n, k_byte) through b_5d_layout.

PreshuffledScaleLoader[MN_padded, K_SCALES]: Loads one packed Int32 scale word per lane (4 E8M0 bytes covering MNXdlPack=2 x KXdlPack=2 sub-MMAs) via buffer_load_dword, indexed by logical (mn, k_scale) through scale_4d_layout.

Structs​