For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
mxfp4_matmul_amd_preb
MXFP4 block-scaled matmul on AMD CDNA4 with preshuffled B + scales + direct VGPR loads.
Variant of MXFP4MatmulAMD that skips LDS staging for both B and the
A/B scales. B is preshuffled host-side via Shuffler.preshuffle_b_5d
so each lane's 16-byte fragment lives at a known DRAM offset and is
read with a single buffer_load_dwordx4. Scales are addressed by
Shuffler.scale_4d_byte_off β each lane reads one Int32 covering a
(mn_pack=2, k_pack=2) cell that feeds 4 sub-MMAs via the MFMA's
OPSEL byte selector.
Only suitable when num_warps_m == 1 (BM == WM) β otherwise B would be
read multiply across the warps in the M direction without LDS reuse.
Tile constraints:
BM == 16 or BM % 32 == 0. BM=16 uses one sub-MMA per CTA along M; the scale i32's mn_pack=1 byte is rotated into OPSEL byte 0/2 withshrui(seeBlockScaledMmaOp_PreB.mma).WN == 16 or WN % 32 == 0. Same logic per-warp along N.num_k_mmasmust be even (k_pack=2 cell halves).Nmust be a multiple of 32 (= 16 * mn_pack) for B-scale cell alignment.
Structsβ
- β
BlockScaledMmaOp_PreB: Per-warp register state + MFMA dispatch for the preb (preshuffled-B, preshuffled-scale) kernel. - β
MXFP4MatmulAMD_PreB: Preshuffled-B variant ofMXFP4MatmulAMD.
Functionsβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!