Mojo struct

B200MatmulSmem

struct B200MatmulSmem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, *, config: MatmulConfig[a_type, b_type, c_type, transpose_b]]

Shared memory layout for B200 SM100 matrix multiplication kernel.

This struct manages the shared memory allocation for:

Input tiles (A and B matrices) with multi-stage pipelining
Output tile (C matrix) for accumulation
Synchronization barriers for producer-consumer coordination
CLC (Cluster Launch Control) barriers and response storage
TMEM (Tensor Memory) address and deallocation barrier

The memory is organized to support asynchronous TMA loads and efficient bank-conflict-free access patterns for tensor core operations.

Type aliases are provided for tile types (ATile, BTile, CTile) to enable cleaner function signatures without verbose LayoutTensor declarations.

Fields

a_tiles_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ATileArray.StorageType):
b_tiles_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BTileArray.StorageType):
c_tiles_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].CTileArray.StorageType):
input_barriers_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].InputBarriers.StorageType):
accum_barriers_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].AccumBarriers.StorageType):
clc_full_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ClcBarriers.StorageType):
clc_empty_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ClcBarriers.StorageType):
clc_throttle_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ClcThrottleBarriers.StorageType):
clc_response_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].ClcResponse.StorageType):
tmem_dealloc_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].TmemDealloc.StorageType):
tmem_addr_storage (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].TmemAddr.StorageType):

Implemented traits

AnyType, ImplicitlyDestructible

`comptime` members

`delis_trivial`

comptime __del__is_trivial = True

`a_smem_layout`

comptime a_smem_layout = tile_layout_k_major[a_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, config.a_swizzle]()

`AccumBarriers`

comptime AccumBarriers = SMemArrayType[SharedMemBarrier, (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_accum_pipeline_stages * 2)]

`ATileArray`

comptime ATileArray = SMemTileArrayType[a_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].a_smem_layout, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages, 128]

`b_smem_layout`

comptime b_smem_layout = tile_layout_k_major[b_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, config.b_swizzle]() if transpose_b else tile_layout_mn_major[b_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BN, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].BK, config.b_swizzle]()

`BK`

comptime BK = config.block_tile_shape.__getitem__[3, DType.int64, Int](2)

`BM`

comptime BM = config.block_tile_shape.__getitem__[3, DType.int64, Int](0)

`BN`

comptime BN = config.block_tile_shape.__getitem__[3, DType.int64, Int](1)

`BTileArray`

comptime BTileArray = SMemTileArrayType[b_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].b_smem_layout, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages, 128]

`c_smem_layout`

comptime c_smem_layout = Layout.row_major(B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputM, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].OutputN)

`ClcBarriers`

comptime ClcBarriers = SMemArrayType[SharedMemBarrier, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_clc_pipeline_stages]

`ClcResponse`

comptime ClcResponse = SMemArrayType[UInt128, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_clc_pipeline_stages]

`ClcThrottleBarriers`

comptime ClcThrottleBarriers = SMemArrayType[SharedMemBarrier, (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_clc_pipeline_stages * 2)]

`CTileArray`

comptime CTileArray = SMemTileArrayType[c_type, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].c_smem_layout, B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_output_stages, 128]

`InputBarriers`

comptime InputBarriers = SMemArrayType[SharedMemBarrier, (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_group_pipeline_stages * 2)]

`num_accum_pipeline_stages`

comptime num_accum_pipeline_stages = config.num_accum_pipeline_stages

`num_clc_pipeline_stages`

comptime num_clc_pipeline_stages = config.num_clc_pipeline_stages

`num_group_pipeline_stages`

comptime num_group_pipeline_stages = (B200MatmulSmem[a_type, b_type, c_type, transpose_b, config=config].num_pipeline_stages // Int.__init__[Int](config.k_group_size))

`num_output_stages`

comptime num_output_stages = Int(config)

`num_pipeline_stages`

comptime num_pipeline_stages = Int.__init__[Int](config.num_pipeline_stages)

`OutputM`

comptime OutputM = config.output_tile_shape.__getitem__[2, DType.int64, Int](0)

`OutputN`

comptime OutputN = config.output_tile_shape.__getitem__[2, DType.int64, Int](1)

`TmemAddr`

comptime TmemAddr = SMemArrayType[UInt32, 1]

`TmemDealloc`

Returns:

Int

`c_output_size`

static c_output_size() -> Int

Size of C tiles for all output stages (in elements).

Returns:

Int

`total_tile_size`

static total_tile_size() -> Int

Total tile storage size (A+B+C) in elements.

Returns:

Int

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

Fields​

Implemented traits​

comptime members​

__del__is_trivial​

a_smem_layout​

AccumBarriers​

ATileArray​

b_smem_layout​

BK​

BM​

BN​

BTileArray​

c_smem_layout​

ClcBarriers​

ClcResponse​

ClcThrottleBarriers​

CTileArray​

InputBarriers​

num_accum_pipeline_stages​

num_clc_pipeline_stages​

num_group_pipeline_stages​

num_output_stages​

num_pipeline_stages​

OutputM​

OutputN​

TmemAddr​

TmemDealloc​

Methods​

a_tiles​

b_tiles​

c_tiles​

input_barriers​

accum_barriers​

clc_full​

clc_empty​

clc_throttle​

clc_response​

tmem_dealloc​

tmem_addr​

ab_pipeline_size​

c_output_size​

total_tile_size​

Fields

Implemented traits

`comptime` members

`delis_trivial`

`a_smem_layout`

`AccumBarriers`

`ATileArray`

`b_smem_layout`

`BK`

`BM`

`BN`

`BTileArray`

`c_smem_layout`

`ClcBarriers`

`ClcResponse`

`ClcThrottleBarriers`

`CTileArray`

`InputBarriers`

`num_accum_pipeline_stages`

`num_clc_pipeline_stages`

`num_group_pipeline_stages`

`num_output_stages`

`num_pipeline_stages`

`OutputM`

`OutputN`

`TmemAddr`

`TmemDealloc`

Methods

`a_tiles`

`b_tiles`

`c_tiles`

`input_barriers`

`accum_barriers`

`clc_full`

`clc_empty`

`clc_throttle`

`clc_response`

`tmem_dealloc`

`tmem_addr`

`ab_pipeline_size`

`c_output_size`

`total_tile_size`