For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

PreshuffledScaleLoader

struct PreshuffledScaleLoader[MN_padded: Int, K_SCALES: Int]

Per-lane packed-Int32 scale loader from preshuffled GMEM.

Each i32 cell holds 4 E8M0 bytes packed in (k_pack, mn_pack) order; the MFMA's opsel byte index selects the right byte per sub-MMA. OOB lanes (past MN_padded * K_SCALES) read as zero.

Parameters

MN_padded (Int): MN dimension rounded up to 32 (the scale-block stride).
K_SCALES (Int): K // 32 — one E8M0 byte per 32 FP4 elements.

Fields

bc (AMDBufferResource):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods

`init`

def __init__(scale_gmem_tile: TileTensor[DType.uint8, address_space=scale_gmem_tile.address_space, linear_idx_type=scale_gmem_tile.linear_idx_type, element_size=scale_gmem_tile.element_size]) -> Self

Builds the V# from a preshuffled per-expert scale byte buffer.

`load_packed`

def load_packed(self, mn: Int, k_scale: Int) -> Int32

Loads the packed Int32 scale word containing logical (mn, k_scale).

Pass (mn, k_scale) at (mn_pack=0, k_pack=0) — the cell base — and all 4 bytes of the cell come back in the returned i32. The MFMA's opsel then selects the byte for each sub-MMA.

Per-lane usage: mn = warp_mn_off + lane % 16 # mn_lane within block k_scale = k_pair_idx * 8 + (lane // 16) # k_lane within block

Returns:

Int32

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

load_packed​