For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Shuffler

struct Shuffler[E: Int]

MXFP4 preshuffle layouts and helpers for AMD CDNA4.

Parameters

E (Int): Number of groups (experts / sort-blocks) the shuffler operates on. Use Shuffler[1] for single-group consumers.

Implemented traits

AnyType, ImplicitlyDeletable

`comptime` members

`b_5d_grouped_layout`

comptime b_5d_grouped_layout[N: Int, K_BYTES: Int] = Layout(Coord(ComptimeInt(), Coord(ComptimeInt(), ComptimeInt()), Coord(ComptimeInt(), ComptimeInt(), ComptimeInt())), Coord(ComptimeInt(), Coord(ComptimeInt(), ComptimeInt()), Coord(ComptimeInt(), ComptimeInt(), ComptimeInt())))

Parameters

N (Int):
K_BYTES (Int):

`B_STRIDE_K0`

comptime B_STRIDE_K0 = Int(1024)

`B_STRIDE_K_LANE`

comptime B_STRIDE_K_LANE = Int(256)

`B_STRIDE_LANE_BYTES`

comptime B_STRIDE_LANE_BYTES = Int(1)

`B_STRIDE_MN_LANE`

comptime B_STRIDE_MN_LANE = Shuffler[E].MFMA_LANE_BYTES

`BTileTensor`

comptime BTileTensor[N: Int, K_BYTES: Int] = TileTensor[DType.uint8, Layout[*?, *?], MutAnyOrigin]

Parameters

N (Int):
K_BYTES (Int):

`MFMA_K_BYTES`

comptime MFMA_K_BYTES = Int(64)

`MFMA_K_LANES`

comptime MFMA_K_LANES = Int(4)

`MFMA_LANE_BYTES`

comptime MFMA_LANE_BYTES = Int(16)

`MFMA_MN_LANES`

comptime MFMA_MN_LANES = Int(16)

`NUM_THREADS`

comptime NUM_THREADS = Int(64)

`packed_scale_bytes`

comptime packed_scale_bytes = Int(4)

`S_K_BLOCK`

comptime S_K_BLOCK = Int(8)

`S_K_PACK`

comptime S_K_PACK = Int(2)

`S_MN_BLOCK`

comptime S_MN_BLOCK = Int(32)

`S_MN_PACK`

comptime S_MN_PACK = Int(2)

Methods

`scale_4d_byte_off`

static def scale_4d_byte_off[K_SCALES: Int, packed_mode: Bool = False](mn: Int, k_scale: Int) -> Int

Returns:

Int

`scale_4d_slot_byte_off`

static def scale_4d_slot_byte_off[K_SCALES: Int, packed_mode: Bool = False](expert_slot: Int, mn: Int, k_scale: Int, max_padded_M: Int) -> Int

Byte offset of an E8M0 scale within the per-expert scale_4d slot.

Single source of truth for the offset shared by (1) the standalone _preshuffle_grouped_scale_4d_kernel, (2) the fused_silu KS64 fold, and (3) the ep_wait KS224 fold. Each expert owns a fixed-stride slot of max_padded_M * K_SCALES bytes; within it the scale lands at scale_4d_byte_off(mn, k_scale).

Parameters:

K_SCALES (Int): Number of E8M0 scales along K (K // 32).
packed_mode (Bool): Byte index of the next packed scale (used by the standalone preshuffle's i32-cell gather); otherwise the byte index of the scale itself.

Args:

expert_slot (Int): Per-expert slot index (expert_id + shared_offset).
mn (Int): Local row within the expert (token row, 0-based).
k_scale (Int): Scale index along K.
max_padded_M (Int): Per-expert slot stride in rows (= align_up(max, 32)).

Returns:

Int: Byte offset into the flat scale_4d buffer.

`scale_padded_mn`

static def scale_padded_mn(MN: Int) -> Int

Padded MN dim used by the 4D scale layout: MN rounded up to 32.

Args:

MN (Int): Unpadded extent along the scale tensor's MN axis (number of rows) before rounding up to the S_MN_BLOCK atom tile.

Returns:

Int

`preshuffle_b_5d`

static def preshuffle_b_5d[N: Int, K_BYTES: Int](raw: TileTensor[DType.uint8, Storage=raw.Storage, linear_idx_type=raw.linear_idx_type], dst: TileTensor[DType.uint8, Storage=dst.Storage, linear_idx_type=dst.linear_idx_type], ctx: DeviceContext)

Launch the GPU MXFP4 B 5D preshuffle.

Invoked eagerly from model weight adapters (one-shot graph) so the shuffle runs once at session.load instead of the ~hours-long numpy CPU path. Mirrors block_scales_interleave's origin handling pattern (accept any origin, cast to any-origin for the kernel).

Parameters:

N (Int): Per-expert N (must be a multiple of 16).
K_BYTES (Int): Per-expert FP4-packed K (must be a multiple of 64).

Args:

raw (TileTensor[DType.uint8, Storage=raw.Storage, linear_idx_type=raw.linear_idx_type]): Row-major source weights [E, N, K_BYTES].
dst (TileTensor[DType.uint8, Storage=dst.Storage, linear_idx_type=dst.linear_idx_type]): Destination buffer (same byte footprint; bytes get written in b_5d_grouped_layout order).
ctx (DeviceContext): AMD device context.

`preshuffle_scale_4d`

static def preshuffle_scale_4d[MN: Int, K_SCALES: Int, SrcLayout: TensorLayout](src: TileTensor[DType.uint8, SrcLayout, MutAnyOrigin], mut dst: HostBuffer[DType.uint8])

`preshuffle_grouped_scale_4d_gpu`

static def preshuffle_grouped_scale_4d_gpu[K_SCALES: Int, SfaRawLayout: TensorLayout, SfaPreLayout: TensorLayout, AOffsetsLayout: TensorLayout](sfa_raw: TileTensor[DType.uint8, SfaRawLayout, Storage=sfa_raw.Storage, linear_idx_type=sfa_raw.linear_idx_type], sfa_pre: TileTensor[DType.uint8, SfaPreLayout, Storage=sfa_pre.Storage, linear_idx_type=sfa_pre.linear_idx_type], a_offsets: TileTensor[DType.uint32, AOffsetsLayout, Storage=a_offsets.Storage, linear_idx_type=a_offsets.linear_idx_type], num_active_experts: Int, max_num_tokens_per_expert: Int, total_wg: Int, ctx: DeviceContext)

Parameters​

Implemented traits​

comptime members​

b_5d_grouped_layout​

Parameters​

B_STRIDE_K0​

B_STRIDE_K_LANE​

B_STRIDE_LANE_BYTES​

B_STRIDE_MN_LANE​

BTileTensor​

Parameters​

MFMA_K_BYTES​

MFMA_K_LANES​

MFMA_LANE_BYTES​

MFMA_MN_LANES​

NUM_THREADS​

packed_scale_bytes​

S_K_BLOCK​

S_K_PACK​

S_MN_BLOCK​

S_MN_PACK​

Methods​

scale_4d_byte_off​

scale_4d_slot_byte_off​

scale_padded_mn​

preshuffle_b_5d​

preshuffle_scale_4d​

preshuffle_grouped_scale_4d_gpu​

Parameters

Implemented traits

`comptime` members

`b_5d_grouped_layout`

Parameters

`B_STRIDE_K0`

`B_STRIDE_K_LANE`

`B_STRIDE_LANE_BYTES`

`B_STRIDE_MN_LANE`

`BTileTensor`

Parameters

`MFMA_K_BYTES`

`MFMA_K_LANES`

`MFMA_LANE_BYTES`

`MFMA_MN_LANES`

`NUM_THREADS`

`packed_scale_bytes`

`S_K_BLOCK`

`S_K_PACK`

`S_MN_BLOCK`

`S_MN_PACK`

Methods

`scale_4d_byte_off`

`scale_4d_slot_byte_off`

`scale_padded_mn`

`preshuffle_b_5d`

`preshuffle_scale_4d`

`preshuffle_grouped_scale_4d_gpu`