Skip to main content

Mojo struct

PreshuffledScaleLoader

struct PreshuffledScaleLoader[MN_padded: Int, K_SCALES: Int]

Per-lane packed-Int32 scale loader from preshuffled GMEM.

Each i32 cell holds 4 E8M0 bytes packed in (k_pack, mn_pack) order; the MFMA's opsel byte index selects the right byte per sub-MMA. OOB lanes (past MN_padded * K_SCALES) read as zero.

Parameters​

  • ​MN_padded (Int): MN dimension rounded up to 32 (the scale-block stride).
  • ​K_SCALES (Int): K // 32 β€” one E8M0 byte per 32 FP4 elements.

Fields​

  • ​bc (AMDBufferResource):

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

__init__(scale_gmem_tile: TileTensor[DType.uint8, address_space=scale_gmem_tile.address_space, linear_idx_type=scale_gmem_tile.linear_idx_type, element_size=scale_gmem_tile.element_size]) -> Self

Builds the V# from a preshuffled per-expert scale byte buffer.

load_packed​

load_packed(self, mn: Int, k_scale: Int) -> Int32

Loads the packed Int32 scale word containing logical (mn, k_scale).

Pass (mn, k_scale) at (mn_pack=0, k_pack=0) β€” the cell base β€” and all 4 bytes of the cell come back in the returned i32. The MFMA's opsel then selects the byte for each sub-MMA.

Per-lane usage: mn = warp_mn_off + lane % 16 # mn_lane within block k_scale = scale_pack_idx * 8 + (lane // 16) # k_lane within block

Returns:

Int32