IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

PreshuffledScaleLoader

struct PreshuffledScaleLoader[MN_padded: Int, K_SCALES: Int]

Per-lane packed-Int32 scale loader from preshuffled GMEM.

Each i32 cell holds 4 E8M0 bytes packed in (k_pack, mn_pack) order; the MFMA's opsel byte index selects the right byte per sub-MMA. OOB lanes (past MN_padded * K_SCALES) read as zero.

Parameters​

  • ​MN_padded (Int): MN dimension rounded up to 32 (the scale-block stride).
  • ​K_SCALES (Int): K // 32 β€” one E8M0 byte per 32 FP4 elements.

Fields​

  • ​bc (AMDBufferResource):

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

def __init__(scale_gmem_tile: TileTensor[DType.uint8, address_space=scale_gmem_tile.address_space, linear_idx_type=scale_gmem_tile.linear_idx_type, element_size=scale_gmem_tile.element_size]) -> Self

Builds the V# from a preshuffled per-expert scale byte buffer.

load_packed​

def load_packed(self, mn: Int, k_scale: Int) -> Int32

Loads the packed Int32 scale word containing logical (mn, k_scale).

Pass (mn, k_scale) at (mn_pack=0, k_pack=0) β€” the cell base β€” and all 4 bytes of the cell come back in the returned i32. The MFMA's opsel then selects the byte for each sub-MMA.

Per-lane usage: mn = warp_mn_off + lane % 16 # mn_lane within block k_scale = k_pair_idx * 8 + (lane // 16) # k_lane within block

Returns:

Int32