Skip to main content

Mojo struct

B200MatmulSmem

struct B200MatmulSmem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, *, config: MatmulConfig[a_type, b_type, c_type, transpose_b]]

Shared memory layout for B200 SM100 matrix multiplication kernel.

This struct manages the shared memory allocation for:

  • Input tiles (A and B matrices) with multi-stage pipelining
  • Output tile (C matrix) for accumulation
  • Synchronization barriers for producer-consumer coordination
  • CLC (Cluster Launch Control) barriers and response storage
  • TMEM (Tensor Memory) address and deallocation barrier

The memory is organized to support asynchronous TMA loads and efficient bank-conflict-free access patterns for tensor core operations.

Type aliases are provided for tile types (ATile, BTile, CTile) to enable cleaner function signatures.

Fields​

  • ​input_tiles (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles):
  • ​output_tiles (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputTiles):
  • ​epilogue_load_tiles_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].EpilogueLoadTileArray.Storage):
  • ​pipelines (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].Pipelines):

Implemented traits​

AnyType, ImplicitlyDestructible

comptime members​

ATileArray​

comptime ATileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles.ATileArray

BK​

comptime BK = config.block_tile_shape[2]

BM​

comptime BM = config.block_tile_shape[0]

BN​

comptime BN = config.block_tile_shape[1]

BTileArray​

comptime BTileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles.BTileArray

CTileArray​

comptime CTileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputTiles.CTileArray

epilogue_load_tile_cols​

comptime epilogue_load_tile_cols = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].MMA_N if config.epilogue_is_1d else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputN

epilogue_load_tile_rows​

comptime epilogue_load_tile_rows = 1 if config.epilogue_is_1d else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].MMA_N if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM

EpilogueLoadTileArray​

comptime EpilogueLoadTileArray = SMemTileArray2DRowMajor[c_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].epilogue_load_tile_rows, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].epilogue_load_tile_cols, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_epilogue_load_stages]

InputTiles​

comptime InputTiles = StandardTileStorage[a_type, b_type, IndexList(B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, __list_literal__=NoneType(None)), IndexList(B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, __list_literal__=NoneType(None)), B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages]

Layouts​

comptime Layouts = SmemLayouts[a_type, b_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputN, config.a_swizzle, config.b_swizzle, transpose_b]

MMA_N​

comptime MMA_N = config.mma_shape[1]

num_accum_pipeline_stages​

comptime num_accum_pipeline_stages = config.num_accum_pipeline_stages

num_clc_pipeline_stages​

comptime num_clc_pipeline_stages = config.num_clc_pipeline_stages

num_epilogue_load_stages​

comptime num_epilogue_load_stages = config.num_accum_pipeline_stages if config.AB_swapped or config.epilogue_is_1d else config.num_tma_epilogue_pipeline_stages if config.use_tma_epilogue_load else 0

num_group_pipeline_stages​

comptime num_group_pipeline_stages = (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages // config)

num_output_stages​

comptime num_output_stages = config.num_output_stages

num_pipeline_stages​

comptime num_pipeline_stages = config.num_pipeline_stages

OutputM​

comptime OutputM = config.output_tile_shape[0]

OutputN​

comptime OutputN = config.output_tile_shape[1]

OutputTiles​

comptime OutputTiles = OutputTileStorage[c_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_output_stages]

Pipelines​

comptime Pipelines = SmemPipelineBundle[B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_group_pipeline_stages, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_accum_pipeline_stages, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_clc_pipeline_stages, StandardTilePayload[a_type, b_type, IndexList(B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, __list_literal__=NoneType(None)), IndexList(B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, __list_literal__=NoneType(None)), B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages], B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_epilogue_load_stages]

Methods​

a_tiles​

a_tiles(ref[AddressSpace._value] self) -> B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ATileArray

Returns:

B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ATileArray

b_tiles​

b_tiles(ref[AddressSpace._value] self) -> B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BTileArray

Returns:

B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BTileArray

c_tiles​

c_tiles(ref[AddressSpace._value] self) -> B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].CTileArray

Returns:

B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].CTileArray

epilogue_load_tiles​

epilogue_load_tiles(ref[AddressSpace._value] self) -> B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].EpilogueLoadTileArray

Returns:

B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].EpilogueLoadTileArray

ab_pipeline_size​

static ab_pipeline_size() -> Int

Total size of A+B tiles for all pipeline stages (in elements).

Returns:

Int

c_output_size​

static c_output_size() -> Int

Size of C tiles for all output stages (in elements).

Returns:

Int

epilogue_load_tile_size​

static epilogue_load_tile_size() -> Int

Size of epilogue load tiles for all stages (in elements). Zero when config.use_tma_epilogue_load=False.

Returns:

Int

total_tile_size​

static total_tile_size() -> Int

Total tile storage size (A+B+C+epilogue load) in elements.

Returns:

Int