Mojo function

load_b_nt

load_b_nt[mma_shape: IndexList[3], swizzle: Optional[Swizzle] = Optional()](tile: LayoutTensor[tile.dtype, tile.layout, tile.origin, address_space=AddressSpace.SHARED, element_layout=tile.element_layout, layout_int_type=tile.layout_int_type, linear_idx_type=tile.linear_idx_type, masked=tile.masked, alignment=tile.alignment]) -> SIMD[tile.dtype, 8]

Loads the b operand tile for AMD tensor core MFMA from (N, K) storage.

This function supports double-rate MFMA shapes (32x32x16, 16x16x32) with bfloat16 input. Unlike load_b_tr which expects (K, N) storage, this function works with (N, K) storage which is common when transpose_b=True and B is stored row-major.

The input tile (shape = (mma_shape[1], mma_shape[2])) is split along the K dimension into two halves of shape (MMA_N, MMA_K//2). Each half is loaded using _load_tr16_b64_warp, which performs a transposed (column-major) load from shared memory. The hardware transpose effectively converts the (N, K) storage to (K, N) format needed by MMA.

Example: For 16x16x32 MMA with B stored as (N, K) = (16, 32) in LDS:

# B tile in LDS: shape (16, 32) = (MMA_N, MMA_K)
var b_tile = smem_b.tile[16, 32](n_idx, k_idx)
var b_reg = load_b_nt[IndexList[3](16, 16, 32)](b_tile)
# b_reg now contains 8 bf16 values ready for MFMA

Parameters:

mma_shape (IndexList): The MMA instruction tile shape (only 32x32x16 or 16x16x32 supported).
swizzle (Optional): Optional swizzle pattern for bank-conflict-free LDS access.

Args:

tile (LayoutTensor): A LayoutTensor, residing in shared memory, with shape (mma_shape[1], mma_shape[2]) and dtype DType.bfloat16. This is (N, K) storage order.

Returns:

SIMD: SIMD[tile.dtype, 8]: Concatenated transposed SIMD loads from both halves of the tile.