Mojo struct
PreshuffledBLoader
struct PreshuffledBLoader[N: Int, K_BYTES: Int]
Per-lane B fragment loader from preshuffled GMEM (DRAM -> VGPR direct).
The 5D layout places each lane's 16-byte fragment at a contiguous DRAM
offset, so a single buffer_load_dwordx4 per lane delivers the MFMA's
B operand with no LDS staging. OOB lanes are clamped to zero by the
buffer-resource bounds.
Parametersβ
- βN (
Int): Per-expert N dimension (rows of the logical [N, K_BYTES] tile). - βK_BYTES (
Int): Per-expert FP4-packed K dimension (= K // 2).
Fieldsβ
- βbc (
AMDBufferResource):
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methodsβ
__init__β
__init__(b_gmem_tile: TileTensor[DType.uint8, address_space=b_gmem_tile.address_space, linear_idx_type=b_gmem_tile.linear_idx_type, element_size=b_gmem_tile.element_size]) -> Self
Builds the V# from a preshuffled per-expert B byte buffer.
load_fragmentβ
load_fragment(self, n: Int, k_byte: Int) -> SIMD[DType.uint8, 16]
Loads the 16-byte B fragment at logical (n, k_byte).
For one MFMA dispatch a lane calls this with
(n = warp_n_off + n_mma * 16 + lane % 16, k_byte = k_tile * 64 + (lane // 16) * 16).
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!