Skip to main content

Mojo struct

TiledTensorCore

struct TiledTensorCore[out_type: DType, in_type: DType, shape: IndexList[3], group_size: Int, transpose_b: Bool = False]

TiledTensorCore provides a wrapper around TensorCore to support multiple MMAs along the K dimension.

Enables larger K dimension operations by decomposing them into multiple smaller MMA operations. Currently only being used for AMD GPUs to enable 16x16x32 operations using two 16x16x16 MMAs.

Parameters

  • out_type (DType): The data type for output/accumulation operations.
  • in_type (DType): The data type for input matrix elements.
  • shape (IndexList): The shape parameters for individual MMA operations [M, N, K].
  • group_size (Int): Number of MMA operations along the K dimension.
  • transpose_b (Bool): Whether to transpose the b matrix. Defaults to False.

Implemented traits

AnyType, ImplicitlyDestructible

comptime members

__del__is_trivial

comptime __del__is_trivial = True

mma_op

comptime mma_op = TensorCore()

The underlying TensorCore instance for MMA operations.

Methods

mma

static mma[swap_a_b: Bool = False](a_reg_tile: LayoutTensor[a_reg_tile.dtype, a_reg_tile.layout, a_reg_tile.origin, address_space=a_reg_tile.address_space, element_layout=a_reg_tile.element_layout, layout_int_type=a_reg_tile.layout_int_type, linear_idx_type=a_reg_tile.linear_idx_type, masked=a_reg_tile.masked, alignment=a_reg_tile.alignment], b_reg_tile: LayoutTensor[b_reg_tile.dtype, b_reg_tile.layout, b_reg_tile.origin, address_space=b_reg_tile.address_space, element_layout=b_reg_tile.element_layout, layout_int_type=b_reg_tile.layout_int_type, linear_idx_type=b_reg_tile.linear_idx_type, masked=b_reg_tile.masked, alignment=b_reg_tile.alignment], c_reg_tile: LayoutTensor[c_reg_tile.dtype, c_reg_tile.layout, c_reg_tile.origin, address_space=c_reg_tile.address_space, element_layout=c_reg_tile.element_layout, layout_int_type=c_reg_tile.layout_int_type, linear_idx_type=c_reg_tile.linear_idx_type, masked=c_reg_tile.masked, alignment=c_reg_tile.alignment])

Perform multiple matrix multiply-accumulate operations along the K dimension.

Executes group_size MMA operations, processing slices of the K dimension and accumulating results in c_reg_tile.

Parameters:

  • swap_a_b (Bool): Whether to swap a and b operands. Defaults to False.

Args:

  • a_reg_tile (LayoutTensor): Input matrix a fragments [num_m_mmas, group_size * a_frag_size].
  • b_reg_tile (LayoutTensor): Input matrix b fragments [num_n_mmas, group_size * b_frag_size].
  • c_reg_tile (LayoutTensor): Accumulation matrix c fragments, modified in-place.

Was this page helpful?