Skip to main content

Mojo struct

PreshuffledBLoader

struct PreshuffledBLoader[N: Int, K_BYTES: Int]

Per-lane B fragment loader from preshuffled GMEM (DRAM -> VGPR direct).

The 5D layout places each lane's 16-byte fragment at a contiguous DRAM offset, so a single buffer_load_dwordx4 per lane delivers the MFMA's B operand with no LDS staging. OOB lanes are clamped to zero by the buffer-resource bounds.

Parameters​

  • ​N (Int): Per-expert N dimension (rows of the logical [N, K_BYTES] tile).
  • ​K_BYTES (Int): Per-expert FP4-packed K dimension (= K // 2).

Fields​

  • ​bc (AMDBufferResource):

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

__init__(b_gmem_tile: TileTensor[DType.uint8, address_space=b_gmem_tile.address_space, linear_idx_type=b_gmem_tile.linear_idx_type, element_size=b_gmem_tile.element_size]) -> Self

Builds the V# from a preshuffled per-expert B byte buffer.

load_fragment​

load_fragment(self, n: Int, k_byte: Int) -> SIMD[DType.uint8, 16]

Loads the 16-byte B fragment at logical (n, k_byte).

For one MFMA dispatch a lane calls this with (n = warp_n_off + n_mma * 16 + lane % 16, k_byte = k_tile * 64 + (lane // 16) * 16).

Returns:

SIMD[DType.uint8, 16]