Mojo function
load_b_tr
load_b_tr[mma_shape: IndexList[3]](tile: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> SIMD[dtype, 8]
Loads the b operand tile for AMD tensor core MFMA instructions using transposed memory access.
This function supports double-rate MFMA shapes (32x32x16, 16x16x32) with bfloat16 input.
The input tile (shape = (mma_shape[2], mma_shape[1])) is split along the K dimension into
two halves of shape (MMA_K//2, MMA_N). Each half is loaded using _load_tr16_b64_warp, which
performs a transposed (column-major) load from shared memory. The resulting two 4-element SIMD
vectors are concatenated into a single SIMD[tile.dtype, 8] vector.
Parameters:
- mma_shape (
IndexList): The MMA instruction tile shape (only 32x32x16 or 16x16x32 supported).
Args:
- tile (
LayoutTensor): ALayoutTensor, residing in shared memory, with shape (mma_shape[2], mma_shape[1]) and dtypeDType.bfloat16.
Returns:
SIMD: SIMD[tile.dtype, 8]: Concatenated transposed SIMD loads from both halves of the tile.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!