For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

MXFP4MoERoutedMatmul

struct MXFP4MoERoutedMatmul[BM: Int = Int(64), BN: Int = Int(64), BK_ELEMS: Int = Int(256), num_warps_m: Int = Int(2), num_warps_n: Int = Int(2), topk: Int = Int(1), INPUT_ROW_MODE: InputRowMode = InputRowMode.TOKEN_ID, enable_swizzle: Bool = True]

Implements the routed MXFP4-by-MXFP4 MoE matmul for AMD CDNA4.

The kernel walks a 2D grid where block_idx.x covers N-tiles and block_idx.y covers per-expert sort blocks. Each block decodes sorted_token_ids to gather A rows from the original token order, accumulates the block-scaled MFMA products against the preshuffled B and E8M0 scale buffers, and scatters results to c[t*topk + s, :].

Parameters

BM (Int): M-tile size in rows, also the per-block sort block height.
BN (Int): N-tile size in columns.
BK_ELEMS (Int): K-tile size in MXFP4 elements (two per byte).
num_warps_m (Int): Number of warps assigned to the M dimension.
num_warps_n (Int): Number of warps assigned to the N dimension.
topk (Int): Number of experts each token routes to.
INPUT_ROW_MODE (InputRowMode): Selects how the kernel decodes A's row index from sorted_token_ids.
enable_swizzle (Bool): Enables the XCD/WGM block-id swizzle for MI355X L2 reuse.

Implemented traits

AnyType, ImplicitlyDeletable

`comptime` members

`BK_BYTES`

comptime BK_BYTES = (BK_ELEMS // Int(2))

`BK_SCALES`

comptime BK_SCALES = (BK_ELEMS // Int(32))

`C_FRAG_SIZE`

comptime C_FRAG_SIZE = (Int(256) // _resolve_warp_size())

`FRAG_W_BYTES`

comptime FRAG_W_BYTES = Int(16)

`MMA_K_BYTES`

comptime MMA_K_BYTES = Int(64)

`MMA_M`

comptime MMA_M = Int(16)

`MMA_N`

comptime MMA_N = Int(16)

`num_k_tiles_per_BK`

comptime num_k_tiles_per_BK = ((BK_ELEMS // Int(2)) // Int(64))

`num_m_mmas`

comptime num_m_mmas = ((BM // num_warps_m) // Int(16))

`num_n_mmas`

comptime num_n_mmas = ((BN // num_warps_n) // Int(16))

`num_scale_packs_per_BK`

comptime num_scale_packs_per_BK = (((BK_ELEMS // Int(2)) // Int(64)) // Int(2))

`num_threads`

comptime num_threads = (Int((mul num_warps_m, num_warps_n)) * _resolve_warp_size())

`num_warps`

comptime num_warps = (num_warps_m * num_warps_n)

`pack_K`

comptime pack_K = Int(2)

`sort_block_m`

comptime sort_block_m = BM

`WM`

comptime WM = (BM // num_warps_m)

`WN`

comptime WN = (BN // num_warps_n)

Methods

`run`

static def run[K_BYTES: Int, K_SCALES: Int, N: Int, N_padded_scale: Int](c: TileTensor[Storage=c.Storage, address_space=c.address_space, linear_idx_type=c.linear_idx_type], a_tt: TileTensor[DType.uint8, Storage=a_tt.Storage, address_space=a_tt.address_space, linear_idx_type=a_tt.linear_idx_type], b_pre_tt: TileTensor[DType.uint8, Storage=b_pre_tt.Storage, address_space=b_pre_tt.address_space, linear_idx_type=b_pre_tt.linear_idx_type], sfa_pre_tt: TileTensor[DType.uint8, Storage=sfa_pre_tt.Storage, address_space=sfa_pre_tt.address_space, linear_idx_type=sfa_pre_tt.linear_idx_type], sfb_pre_tt: TileTensor[DType.uint8, Storage=sfb_pre_tt.Storage, address_space=sfb_pre_tt.address_space, linear_idx_type=sfb_pre_tt.linear_idx_type], sorted_token_ids: TileTensor[DType.uint32, Storage=sorted_token_ids.Storage, address_space=sorted_token_ids.address_space, linear_idx_type=sorted_token_ids.linear_idx_type], expert_ids: TileTensor[DType.int32, Storage=expert_ids.Storage, address_space=expert_ids.address_space, linear_idx_type=expert_ids.linear_idx_type], num_tokens: Int, size_expert_ids: Int)

Parameters​

Implemented traits​

comptime members​

BK_BYTES​

BK_SCALES​

C_FRAG_SIZE​

FRAG_W_BYTES​

MMA_K_BYTES​

MMA_M​

MMA_N​

num_k_tiles_per_BK​

num_m_mmas​

num_n_mmas​

num_scale_packs_per_BK​

num_threads​

num_warps​

pack_K​

sort_block_m​

WM​

WN​

Methods​

run​