IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

PreshuffledBLoader

struct PreshuffledBLoader[N: Int, K_BYTES: Int]

Per-lane B fragment loader from preshuffled GMEM (DRAM -> VGPR direct).

The 5D layout places each lane's 16-byte fragment at a contiguous DRAM offset, so a single buffer_load_dwordx4 per lane delivers the MFMA's B operand with no LDS staging. OOB lanes are clamped to zero by the buffer-resource bounds.

Parameters​

  • ​N (Int): Per-expert N dimension (rows of the logical [N, K_BYTES] tile).
  • ​K_BYTES (Int): Per-expert FP4-packed K dimension (= K // 2).

Fields​

  • ​bc (AMDBufferResource):

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

def __init__(b_gmem_tile: TileTensor[DType.uint8, address_space=b_gmem_tile.address_space, linear_idx_type=b_gmem_tile.linear_idx_type, element_size=b_gmem_tile.element_size]) -> Self

Builds the V# from a preshuffled per-expert B byte buffer.

load_fragment​

def load_fragment(self, n: Int, k_byte: Int) -> SIMD[DType.uint8, 16]

Loads the 16-byte B fragment at logical (n, k_byte).

For one MFMA dispatch a lane calls this with (n = warp_n_off + n_mma * 16 + lane % 16, k_byte = k_tile * 64 + (lane // 16) * 16).

Returns:

SIMD[DType.uint8, 16]