For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

B200MatmulSmem

struct B200MatmulSmem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, *, config: MatmulConfig[a_type, b_type, c_type, transpose_b]]

Shared memory layout for B200 SM100 matrix multiplication kernel.

This struct manages the shared memory allocation for:

Input tiles (A and B matrices) with multi-stage pipelining
Output tile (C matrix) for accumulation
Synchronization barriers for producer-consumer coordination
CLC (Cluster Launch Control) barriers and response storage
TMEM (Tensor Memory) address and deallocation barrier

The memory is organized to support asynchronous TMA loads and efficient bank-conflict-free access patterns for tensor core operations.

Type aliases are provided for tile types (ATile, BTile, CTile) to enable cleaner function signatures.

Fields

input_tiles (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles):
output_tiles (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputTiles):
epilogue_load_tiles_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].EpilogueLoadTileArray.Storage):
pipelines (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].Pipelines):

Implemented traits

AnyType, ImplicitlyDeletable

`comptime` members

`ATileArray`

comptime ATileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles.ATileArray

`BK`

comptime BK = config.block_tile_shape[Int(2)]

`BM`

comptime BM = config.block_tile_shape[Int(0)]

`BN`

comptime BN = config.block_tile_shape[Int(1)]

`BTileArray`

comptime BTileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputTiles.BTileArray

`CTileArray`

comptime CTileArray = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputTiles.CTileArray

`epilogue_load_tile_cols`

comptime epilogue_load_tile_cols = B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].MMA_N if config.epilogue_is_1d else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputN

`epilogue_load_tile_rows`

comptime epilogue_load_tile_rows = Int(1) if config.epilogue_is_1d else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].MMA_N if config.AB_swapped else B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM

`EpilogueLoadTileArray`

comptime EpilogueLoadTileArray = SMemTileArray2DRowMajor[c_type, Int(1) if config.epilogue_is_1d else config.mma_shape[Int(1)] if config.AB_swapped else config.block_tile_shape[Int(0)], config.block_tile_shape[Int(0)] if config.AB_swapped else config.mma_shape[Int(1)] if config.epilogue_is_1d else config.block_tile_shape[Int(0)] if config.AB_swapped else config.output_tile_shape[Int(1)], config.num_accum_pipeline_stages if config.AB_swapped or config.epilogue_is_1d else config.num_tma_epilogue_pipeline_stages if config.use_tma_epilogue_load else Int(0)]

`InputTiles`

comptime InputTiles = StandardTileStorage[a_type, b_type, IndexList(config.block_tile_shape[Int(0)], config.block_tile_shape[Int(2)], __list_literal__=NoneType(None)), IndexList(config.block_tile_shape[Int(1)], config.block_tile_shape[Int(2)], __list_literal__=NoneType(None)), config.num_pipeline_stages]

`Layouts`

comptime Layouts = SmemLayouts[a_type, b_type, config.block_tile_shape[Int(0)], config.block_tile_shape[Int(1)], config.block_tile_shape[Int(2)], config.output_tile_shape[Int(0)], config.output_tile_shape[Int(1)], config.a_swizzle, config.b_swizzle, transpose_b]

`MMA_N`

comptime MMA_N = config.mma_shape[Int(1)]

`num_accum_pipeline_stages`

comptime num_accum_pipeline_stages = config.num_accum_pipeline_stages

`num_clc_pipeline_stages`

comptime num_clc_pipeline_stages = config.num_clc_pipeline_stages

`num_epilogue_load_stages`

comptime num_epilogue_load_stages = config.num_accum_pipeline_stages if config.AB_swapped or config.epilogue_is_1d else config.num_tma_epilogue_pipeline_stages if config.use_tma_epilogue_load else Int(0)

`num_group_pipeline_stages`

comptime num_group_pipeline_stages = (config // config)

`num_output_stages`

comptime num_output_stages = config.num_output_stages

`num_pipeline_stages`

comptime num_pipeline_stages = config.num_pipeline_stages

`OutputM`

comptime OutputM = config.output_tile_shape[Int(0)]

`OutputN`

comptime OutputN = config.output_tile_shape[Int(1)]

`OutputTiles`

comptime OutputTiles = OutputTileStorage[c_type, config.output_tile_shape[Int(0)], config.output_tile_shape[Int(1)], config.num_output_stages]

`Pipelines`

comptime Pipelines = SmemPipelineBundle[(config // config), config.num_accum_pipeline_stages, config.num_clc_pipeline_stages, StandardTilePayload[a_type, b_type, IndexList(config.block_tile_shape[Int(0)], config.block_tile_shape[Int(2)], __list_literal__=NoneType(None)), IndexList(config.block_tile_shape[Int(1)], config.block_tile_shape[Int(2)], __list_literal__=NoneType(None)), config.num_pipeline_stages], config.num_accum_pipeline_stages if config.AB_swapped or config.epilogue_is_1d else config.num_tma_epilogue_pipeline_stages if config.use_tma_epilogue_load else Int(0)]

Methods

`a_tiles`

def a_tiles(ref[AddressSpace._value] self) -> Self.ATileArray

Returns:

Self.ATileArray

`b_tiles`

def b_tiles(ref[AddressSpace._value] self) -> Self.BTileArray

Returns:

Self.BTileArray

`c_tiles`

def c_tiles(ref[AddressSpace._value] self) -> Self.CTileArray

Returns:

Self.CTileArray

`epilogue_load_tiles`

def epilogue_load_tiles(ref[AddressSpace._value] self) -> Self.EpilogueLoadTileArray

Returns:

Self.EpilogueLoadTileArray

`ab_pipeline_size`

static def ab_pipeline_size() -> Int

Total size of A+B tiles for all pipeline stages (in elements).

Returns:

Int

`c_output_size`

static def c_output_size() -> Int

Size of C tiles for all output stages (in elements).

Returns:

Int

`epilogue_load_tile_size`

static def epilogue_load_tile_size() -> Int

Size of epilogue load tiles for all stages (in elements). Zero when config.use_tma_epilogue_load=False.

Returns:

Int

`total_tile_size`

static def total_tile_size() -> Int

Total tile storage size (A+B+C+epilogue load) in elements.

Returns:

Int

Fields​

Implemented traits​

comptime members​

ATileArray​

BK​

BM​

BN​

BTileArray​

CTileArray​

epilogue_load_tile_cols​

epilogue_load_tile_rows​

EpilogueLoadTileArray​

InputTiles​

Layouts​

MMA_N​

num_accum_pipeline_stages​

num_clc_pipeline_stages​

num_epilogue_load_stages​

num_group_pipeline_stages​

num_output_stages​

num_pipeline_stages​

OutputM​

OutputN​

OutputTiles​

Pipelines​

Methods​

a_tiles​

b_tiles​

c_tiles​

epilogue_load_tiles​

ab_pipeline_size​

c_output_size​

epilogue_load_tile_size​

total_tile_size​

Fields

Implemented traits

`comptime` members

`ATileArray`

`BK`

`BM`

`BN`

`BTileArray`

`CTileArray`

`epilogue_load_tile_cols`

`epilogue_load_tile_rows`

`EpilogueLoadTileArray`

`InputTiles`

`Layouts`

`MMA_N`

`num_accum_pipeline_stages`

`num_clc_pipeline_stages`

`num_epilogue_load_stages`

`num_group_pipeline_stages`

`num_output_stages`

`num_pipeline_stages`

`OutputM`

`OutputN`

`OutputTiles`

`Pipelines`

Methods

`a_tiles`

`b_tiles`

`c_tiles`

`epilogue_load_tiles`

`ab_pipeline_size`

`c_output_size`

`epilogue_load_tile_size`

`total_tile_size`