IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_matmul_amd_preb

MXFP4 block-scaled matmul on AMD CDNA4 with preshuffled B + scales + direct VGPR loads.

Variant of MXFP4MatmulAMD that skips LDS staging for both B and the A/B scales. B is preshuffled host-side via Shuffler.preshuffle_b_5d so each lane's 16-byte fragment lives at a known DRAM offset and is read with a single buffer_load_dwordx4. Scales are addressed by Shuffler.scale_4d_byte_off β€” each lane reads one Int32 covering a (mn_pack=2, k_pack=2) cell that feeds 4 sub-MMAs via the MFMA's OPSEL byte selector.

Only suitable when num_warps_m == 1 (BM == WM) β€” otherwise B would be read multiply across the warps in the M direction without LDS reuse.

Tile constraints:

  • BM == 16 or BM % 32 == 0. BM=16 uses one sub-MMA per CTA along M; the scale i32's mn_pack=1 byte is rotated into OPSEL byte 0/2 with shrui (see BlockScaledMmaOp_PreB.mma).
  • WN == 16 or WN % 32 == 0. Same logic per-warp along N.
  • num_k_mmas must be even (k_pack=2 cell halves).
  • N must be a multiple of 32 (= 16 * mn_pack) for B-scale cell alignment.

Structs​

Functions​