For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mxfp4_matmul_amd_preb

MXFP4 block-scaled matmul on AMD CDNA4 with preshuffled B + scales + direct VGPR loads.

Variant of MXFP4MatmulAMD that skips LDS staging for both B and the A/B scales. B is preshuffled host-side via Shuffler.preshuffle_b_5d so each lane's 16-byte fragment lives at a known DRAM offset and is read with a single buffer_load_dwordx4. Scales are addressed by Shuffler.scale_4d_byte_off — each lane reads one Int32 covering a (mn_pack=2, k_pack=2) cell that feeds 4 sub-MMAs via the MFMA's OPSEL byte selector.

Only suitable when num_warps_m == 1 (BM == WM) — otherwise B would be read multiply across the warps in the M direction without LDS reuse.

Tile constraints:

BM == 16 or BM % 32 == 0. BM=16 uses one sub-MMA per CTA along M; the scale i32's mn_pack=1 byte is rotated into OPSEL byte 0/2 with shrui (see BlockScaledMmaOp_PreB.mma).
WN == 16 or WN % 32 == 0. Same logic per-warp along N.
num_k_mmas must be even (k_pack=2 cell halves).
N must be a multiple of 32 (= 16 * mn_pack) for B-scale cell alignment.

Structs

BlockScaledMmaOp_PreB: Per-warp register state + MFMA dispatch for the preb (preshuffled-B, preshuffled-scale) kernel.
MXFP4MatmulAMD_PreB: Preshuffled-B variant of MXFP4MatmulAMD.

Functions

a_lds_swizzle:

Structs​

Functions​

Structs

Functions