Mojo struct

BlockwiseFP8Smem

struct BlockwiseFP8Smem[a_type: DType, b_type: DType, c_type: DType, a_scales_type: DType, transpose_b: Bool, *, config: MatmulConfig[a_type, b_type, c_type, transpose_b]]

SMEM struct for blockwise FP8 matmul: A/B tiles, A-scales, C output, barriers.

Key differences from BlockScaledSmem:

A-scales stored in SMEM (1D: 1 x BM per pipeline stage)
No B-scales in SMEM (read from global memory during epilogue)
Used with register-based accumulation pattern

Fields

a_tiles_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].ATileArray.Storage):
b_tiles_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BTileArray.Storage):
c_tiles_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].CTileArray.Storage):
a_scales_tiles_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].AScalesTileArray.Storage):
tma_mma_mbars_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].InputBarriers.Storage):
accum_mbars_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].AccumBarriers.Storage):
clc_mbars_full_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].ClcBarriers.Storage):
clc_mbars_empty_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].ClcBarriers.Storage):
clc_throttle_mbars_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].ClcThrottleBarriers.Storage):
clc_response_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].ClcResponse.Storage):
tmem_dealloc_mbar_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].TmemDealloc.Storage):
tmem_addr_storage (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].TmemAddr.Storage):

Implemented traits

AnyType, ImplicitlyDestructible

`comptime` members

`delis_trivial`

comptime __del__is_trivial = True

`a_scales_smem_layout`

comptime a_scales_smem_layout = Layout.row_major(1, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BM)

`a_smem_layout`

comptime a_smem_layout = tile_layout_k_major[a_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BM, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BK, config.a_swizzle]()

`AccumBarriers`

comptime AccumBarriers = SMemArray[SharedMemBarrier, (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_accum_pipeline_stages * 2)]

`AScalesTileArray`

comptime AScalesTileArray = SMemTileArray[a_scales_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].a_scales_smem_layout, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_pipeline_stages, 128]

`ATileArray`

comptime ATileArray = SMemTileArray[a_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].a_smem_layout, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_pipeline_stages, 128]

`b_smem_layout`

comptime b_smem_layout = tile_layout_k_major[b_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BN, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BK, config.b_swizzle]() if transpose_b else tile_layout_mn_major[b_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BN, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].BK, config.b_swizzle]()

`BK`

comptime BK = config.block_tile_shape.__getitem__[3, DType.int64, Int](2)

`BM`

comptime BM = config.block_tile_shape.__getitem__[3, DType.int64, Int](0)

`BN`

comptime BN = config.block_tile_shape.__getitem__[3, DType.int64, Int](1)

`BTileArray`

comptime BTileArray = SMemTileArray[b_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].b_smem_layout, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_pipeline_stages, 128]

`c_smem_layout`

comptime c_smem_layout = Layout.row_major(BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].OutputM, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].OutputN)

`ClcBarriers`

comptime ClcBarriers = SMemArray[SharedMemBarrier, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_clc_pipeline_stages]

`ClcResponse`

comptime ClcResponse = SMemArray[UInt128, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_clc_pipeline_stages]

`ClcThrottleBarriers`

comptime ClcThrottleBarriers = SMemArray[SharedMemBarrier, (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_clc_pipeline_stages * 2)]

`CTileArray`

comptime CTileArray = SMemTileArray[c_type, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].c_smem_layout, BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_output_stages, 128]

`InputBarriers`

comptime InputBarriers = SMemArray[SharedMemBarrier, (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_group_pipeline_stages * 2)]

`MMA_M`

comptime MMA_M = config.mma_shape.__getitem__[3, DType.int64, Int](0)

`MMA_N`

comptime MMA_N = config.mma_shape.__getitem__[3, DType.int64, Int](1)

`num_accum_pipeline_stages`

comptime num_accum_pipeline_stages = config.num_accum_pipeline_stages

`num_clc_pipeline_stages`

comptime num_clc_pipeline_stages = config.num_clc_pipeline_stages

`num_group_pipeline_stages`

comptime num_group_pipeline_stages = (BlockwiseFP8Smem[a_type, b_type, c_type, a_scales_type, transpose_b, config=config].num_pipeline_stages // config)

`num_output_stages`

comptime num_output_stages = Int.__init__[Int](config.num_output_stages)

`num_pipeline_stages`

comptime num_pipeline_stages = config.num_pipeline_stages

`OutputM`

comptime OutputM = config.output_tile_shape.__getitem__[2, DType.int64, Int](0)

`OutputN`

comptime OutputN = config.output_tile_shape.__getitem__[2, DType.int64, Int](1)

`TmemAddr`

comptime TmemAddr = SMemArray[UInt32, 1]

`TmemDealloc`

Returns:

Int

`a_scales_pipeline_size`

static a_scales_pipeline_size() -> Int

Total size of A-scales tiles for all pipeline stages (in elements).

Returns:

Int

`c_output_size`

static c_output_size() -> Int

Size of C tiles for all output stages (in elements).

Returns:

Int

`total_tile_size`

static total_tile_size() -> Int

Total tile storage size (A+B+A-scales+C) in elements.

Returns:

Int

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

Fields​

Implemented traits​

comptime members​

__del__is_trivial​

a_scales_smem_layout​

a_smem_layout​

AccumBarriers​

AScalesTileArray​

ATileArray​

b_smem_layout​

BK​

BM​

BN​

BTileArray​

c_smem_layout​

ClcBarriers​

ClcResponse​

ClcThrottleBarriers​

CTileArray​

InputBarriers​

MMA_M​

MMA_N​

num_accum_pipeline_stages​

num_clc_pipeline_stages​

num_group_pipeline_stages​

num_output_stages​

num_pipeline_stages​

OutputM​

OutputN​

TmemAddr​

TmemDealloc​

Methods​

a_tiles​

b_tiles​

c_tiles​

a_scales_tiles​

tma_mma_mbars​

accum_mbars​

clc_mbars_full​

clc_mbars_empty​

clc_throttle_mbars​

clc_response​

tmem_dealloc_mbar​

tmem_addr​

input_barriers​

accum_barriers​

tmem_dealloc​

ab_pipeline_size​

a_scales_pipeline_size​

c_output_size​

total_tile_size​

Fields

Implemented traits

`comptime` members

`delis_trivial`

`a_scales_smem_layout`

`a_smem_layout`

`AccumBarriers`

`AScalesTileArray`

`ATileArray`

`b_smem_layout`

`BK`

`BM`

`BN`

`BTileArray`

`c_smem_layout`

`ClcBarriers`

`ClcResponse`

`ClcThrottleBarriers`

`CTileArray`

`InputBarriers`

`MMA_M`

`MMA_N`

`num_accum_pipeline_stages`

`num_clc_pipeline_stages`

`num_group_pipeline_stages`

`num_output_stages`

`num_pipeline_stages`

`OutputM`

`OutputN`

`TmemAddr`

`TmemDealloc`

Methods

`a_tiles`

`b_tiles`

`c_tiles`

`a_scales_tiles`

`tma_mma_mbars`

`accum_mbars`

`clc_mbars_full`

`clc_mbars_empty`

`clc_throttle_mbars`

`clc_response`

`tmem_dealloc_mbar`

`tmem_addr`

`input_barriers`

`accum_barriers`

`tmem_dealloc`

`ab_pipeline_size`

`a_scales_pipeline_size`

`c_output_size`

`total_tile_size`