Mojo function

load_b_tr

load_b_tr[mma_shape: IndexList[3]](tile: LayoutTensor[dtype, layout, origin, address_space=AddressSpace.SHARED, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> SIMD[dtype, 8]

Loads the b operand tile for AMD tensor core MFMA instructions using transposed memory access.

This function supports double-rate MFMA shapes (32x32x16, 16x16x32) with bfloat16 input. The input tile (shape = (mma_shape[2], mma_shape[1])) is split along the K dimension into two halves of shape (MMA_K//2, MMA_N). Each half is loaded using _load_tr16_b64_warp, which performs a transposed (column-major) load from shared memory. The resulting two 4-element SIMD vectors are concatenated into a single SIMD[tile.dtype, 8] vector.

Parameters:

mma_shape (IndexList): The MMA instruction tile shape (only 32x32x16 or 16x16x32 supported).

Args:

tile (LayoutTensor): A LayoutTensor, residing in shared memory, with shape (mma_shape[2], mma_shape[1]) and dtype DType.bfloat16.

Returns:

SIMD: SIMD[tile.dtype, 8]: Concatenated transposed SIMD loads from both halves of the tile.