Mojo struct
PreshuffledScaleLoader
struct PreshuffledScaleLoader[MN_padded: Int, K_SCALES: Int]
Per-lane packed-Int32 scale loader from preshuffled GMEM.
Each i32 cell holds 4 E8M0 bytes packed in (k_pack, mn_pack) order;
the MFMA's opsel byte index selects the right byte per sub-MMA.
OOB lanes (past MN_padded * K_SCALES) read as zero.
Parametersβ
- βMN_padded (
Int): MN dimension rounded up to 32 (the scale-block stride). - βK_SCALES (
Int): K // 32 β one E8M0 byte per 32 FP4 elements.
Fieldsβ
- βbc (
AMDBufferResource):
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methodsβ
__init__β
__init__(scale_gmem_tile: TileTensor[DType.uint8, address_space=scale_gmem_tile.address_space, linear_idx_type=scale_gmem_tile.linear_idx_type, element_size=scale_gmem_tile.element_size]) -> Self
Builds the V# from a preshuffled per-expert scale byte buffer.
load_packedβ
load_packed(self, mn: Int, k_scale: Int) -> Int32
Loads the packed Int32 scale word containing logical (mn, k_scale).
Pass (mn, k_scale) at (mn_pack=0, k_pack=0) β the cell base β
and all 4 bytes of the cell come back in the returned i32. The
MFMA's opsel then selects the byte for each sub-MMA.
Per-lane usage: mn = warp_mn_off + lane % 16 # mn_lane within block k_scale = scale_pack_idx * 8 + (lane // 16) # k_lane within block
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!