Mojo struct

Conv2dFpropKernel

struct Conv2dFpropKernel[act_type: DType, filter_type: DType, out_type: DType, config: Conv2dConfig[act_type, filter_type, out_type], cluster_shape: StaticTuple[Int32, 3] = StaticTuple(1), elementwise_compute_lambda_fn: Optional[elementwise_compute_lambda_type] = None, register_based_epilogue: Bool = True]

SM100 Conv2D forward propagation kernel.

This kernel implements conv2d fprop using implicit GEMM with warp specialization. It reuses the matmul kernel architecture but with convolution-specific address calculation.

The kernel structure:

Scheduler warp: CLC-based tile scheduling
Load warp: TMA loads with im2col transformation
MMA warp: Tensor core operations
Epilogue warps: Output from TMEM to GMEM

Parameters

act_type (DType): Activation data type.
filter_type (DType): Filter data type.
out_type (DType): Output data type.
config (Conv2dConfig): Kernel configuration.
cluster_shape (StaticTuple): CUDA cluster dimensions.
elementwise_compute_lambda_fn (Optional): Optional epilogue lambda for fusion (bias add, activation functions, residual connections).
register_based_epilogue (Bool): Whether to apply the lambda in registers.

Implemented traits

AnyType, ImplicitlyDestructible

`comptime` members

`delis_trivial`

comptime __del__is_trivial = True

`accum_layout`

comptime accum_layout = Layout.row_major(Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_M, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_N)

`accum_pipeline_consumer_arv_count`

comptime accum_pipeline_consumer_arv_count = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group * Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS)

`accum_pipeline_producer_arv_count`

comptime accum_pipeline_producer_arv_count = 1

`accum_type`

comptime accum_type = Conv2dConfig.accum_type()

`AccumTensor`

comptime AccumTensor = TmemTensor[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].accum_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].accum_layout, cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

`act_expected_bytes`

comptime act_expected_bytes = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.act_smem_layout.size() * size_of[act_type]())

`act_swizzle_elems`

comptime act_swizzle_elems = (config.a_swizzle.bytes() // size_of[act_type]())

`act_tile_dim0`

comptime act_tile_dim0 = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BM // Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_N)

`act_tma_load_size`

comptime act_tma_load_size = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_tile_dim0 * Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_swizzle_elems)

`act_tma_rows`

comptime act_tma_rows = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_tile_dim0

`ActDescLayout`

comptime ActDescLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_swizzle_elems], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_swizzle_elems], ComptimeInt[1]]

`ActTileLayout`

comptime ActTileLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK], ComptimeInt[1]]

`ActTileLoaderTypeIm2col`

comptime ActTileLoaderTypeIm2col = TileLoaderTMAIm2col[?, ?, ?, ?, cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

`ActTmaOp`

comptime ActTmaOp = TMATensorTileIm2col[act_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActDescLayout]()]

`BK`

comptime BK = config.block_tile_shape.__getitem__[Int](2)

`BM`

comptime BM = config.block_tile_shape.__getitem__[Int](0)

`BN`

comptime BN = config.block_tile_shape.__getitem__[Int](1)

`clc_consumer_arv_count`

comptime clc_consumer_arv_count = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SCHEDULER_THREADS + (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_SIZE * (((Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TMA_LOAD_THREADS + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_THREADS) + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_LOAD_THREADS) + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS)))

`clc_producer_arv_count`

comptime clc_producer_arv_count = 1

`clc_throttle_consumer_arv_count`

comptime clc_throttle_consumer_arv_count = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SCHEDULER_THREADS

`clc_throttle_producer_arv_count`

comptime clc_throttle_producer_arv_count = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TMA_LOAD_THREADS

`CLUSTER_M`

comptime CLUSTER_M = config.cluster_shape.__getitem__[Int](0)

`CLUSTER_N`

comptime CLUSTER_N = config.cluster_shape.__getitem__[Int](1)

`CLUSTER_SIZE`

comptime CLUSTER_SIZE = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_M * Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_N)

`Context`

comptime Context = KernelContext[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_clc_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_M, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_N]

`cta_group`

comptime cta_group = config.cta_group

`epi_load_consumer_arv_count`

comptime epi_load_consumer_arv_count = SIMD(Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS)

`epi_load_producer_arv_count`

comptime epi_load_producer_arv_count = 1

`EpiLoadPipelineType`

comptime EpiLoadPipelineType = EpiLoadPipeline[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_epi_load_stages]

`EPILOGUE_LOAD_THREADS`

comptime EPILOGUE_LOAD_THREADS = WARP_SIZE

`EPILOGUE_THREADS`

comptime EPILOGUE_THREADS = (4 * WARP_SIZE)

`EpilogueCtx`

comptime EpilogueCtx = EpilogueWarpContext[config.num_accum_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].stage_stride_cols, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_THREADS, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS]

`filter_expected_bytes`

comptime filter_expected_bytes = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.filter_smem_layout.size() * size_of[filter_type]())

`filter_swizzle_elems`

comptime filter_swizzle_elems = (config.b_swizzle.bytes() // size_of[filter_type]())

`filter_tile_dim0`

comptime filter_tile_dim0 = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BN // (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_M // Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group))

`filter_tma_load_size`

comptime filter_tma_load_size = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_tile_dim0 * Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_swizzle_elems)

`filter_tma_rows`

comptime filter_tma_rows = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_tile_dim0

`FilterDescLayout`

comptime FilterDescLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_swizzle_elems], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_swizzle_elems], ComptimeInt[1]]

`FilterTileLayout`

comptime FilterTileLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK], ComptimeInt[1]]

`FilterTileLoaderType`

comptime FilterTileLoaderType = TileLoaderTMA[?, ?, ?, ?, cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

`FilterTmaOp`

comptime FilterTmaOp = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTmaTile.InnerType

`FilterTmaTile`

comptime FilterTmaTile = TMATile[filter_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTileLayout, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterDescLayout]

`input_expected_bytes`

comptime input_expected_bytes = ((Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group * (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].act_expected_bytes + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].filter_expected_bytes)) * config)

`InputTilePipelineType`

comptime InputTilePipelineType = InputTilePipeline[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TilePayload, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_group_pipeline_stages, config.k_group_size]

`MMA_K`

comptime MMA_K = config.mma_shape.__getitem__[Int](2)

`MMA_M`

comptime MMA_M = config.mma_shape.__getitem__[Int](0)

`MMA_N`

comptime MMA_N = config.mma_shape.__getitem__[Int](1)

`MMA_THREADS`

comptime MMA_THREADS = WARP_SIZE

`MmaCtx`

comptime MmaCtx = MmaWarpContext[config.num_accum_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].stage_stride_cols, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_THREADS, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS]

`MmaEpilogueSync`

comptime MmaEpilogueSync = WarpGroupBarrier[(Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_THREADS + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS), 1]

`MmaOp`

comptime MmaOp = MmaOpSM100_SS[out_type, act_type, filter_type, config.block_tile_shape, config.mma_shape, accum_type=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].accum_type, cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, cluster_shape=config.cluster_shape, a_swizzle=config.a_swizzle, b_swizzle=config.b_swizzle, transpose_b=True]

`num_accum_pipeline_stages`

comptime num_accum_pipeline_stages = config.num_accum_pipeline_stages

`num_clc_pipeline_stages`

comptime num_clc_pipeline_stages = config.num_clc_pipeline_stages

`num_epi_load_stages`

comptime num_epi_load_stages = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_epi_load_stages

`num_group_pipeline_stages`

comptime num_group_pipeline_stages = (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_pipeline_stages // config)

`num_output_stages`

comptime num_output_stages = config.num_output_stages

`num_output_warps`

comptime num_output_warps = 4

`num_pipeline_stages`

comptime num_pipeline_stages = config.num_pipeline_stages

`NUM_THREADS`

comptime NUM_THREADS = ((((Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SCHEDULER_THREADS + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TMA_LOAD_THREADS) + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_THREADS) + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_LOAD_THREADS) + Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].EPILOGUE_THREADS)

`NUM_TMEM_COLS`

comptime NUM_TMEM_COLS = 512

`out_swizzle_elems`

comptime out_swizzle_elems = (config.c_swizzle.bytes() // size_of[out_type]())

`out_tile_dim0`

comptime out_tile_dim0 = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutputM if (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_M == 256) if (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].MMA_M == 256)._mlir_value else (Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group == 1) else 64

`OutDescLayout`

comptime OutDescLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].out_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].out_swizzle_elems], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].out_swizzle_elems], ComptimeInt[1]]

`OutputM`

comptime OutputM = config.output_tile_shape.__getitem__[Int](0)

`OutputN`

comptime OutputN = config.output_tile_shape.__getitem__[Int](1)

`OutputPipeline`

comptime OutputPipeline = OutputTilePipeline[config.num_accum_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].stage_stride_cols, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

`OutTileLayout`

comptime OutTileLayout = Layout[ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].out_tile_dim0], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutputN], ComptimeInt[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutputN], ComptimeInt[1]]

`OutTmaOp`

comptime OutTmaOp = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTmaTile.InnerType

`OutTmaTile`

comptime OutTmaTile = TMATile[out_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTileLayout, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutDescLayout]

`Scheduler`

comptime Scheduler = TileScheduler[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_clc_pipeline_stages, Index[dtype=DType.uint32](config.cluster_shape.__getitem__[Int](0), config.cluster_shape.__getitem__[Int](1), config.cluster_shape.__getitem__[Int](2)), block_swizzle_size=config.block_swizzle_size]

`SCHEDULER_THREADS`

comptime SCHEDULER_THREADS = WARP_SIZE

`SmemType`

comptime SmemType = Conv2dSmem[act_type, filter_type, out_type, config=config]

`src_expected_bytes`

comptime src_expected_bytes = ((Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutputM * Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutputN) * size_of[out_type]())

`SrcCTileArray`

comptime SrcCTileArray = SMemTileArray2DRowMajor[out_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.OutputM, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.OutputN, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_output_stages]

`SrcDescLayout`

comptime SrcDescLayout = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutDescLayout

`SrcTileLayout`

comptime SrcTileLayout = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTileLayout

`SrcTileLoaderType`

comptime SrcTileLoaderType = TileLoaderTMA[?, ?, ?, ?, cta_group=1]

`SrcTmaOp`

comptime SrcTmaOp = Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SrcTmaTile.InnerType

`SrcTmaTile`

comptime SrcTmaTile = TMATile[out_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SrcTileLayout, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SrcDescLayout]

`stage_stride_cols`

comptime stage_stride_cols = (512 // Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_accum_pipeline_stages)

`TilePayload`

comptime TilePayload = StandardTilePayload[act_type, filter_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BM, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BN, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].BK, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_pipeline_stages]

`TileWriterType`

comptime TileWriterType = TileWriter[act_type, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].accum_type, config.block_tile_shape, config.mma_shape, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, config.num_accum_pipeline_stages, config.c_swizzle, False, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.OutputM, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.OutputN, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_output_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].stage_stride_cols, 4, elementwise_compute_lambda_fn, register_based_epilogue]

`TMA_LOAD_THREADS`

comptime TMA_LOAD_THREADS = WARP_SIZE

`Tmem`

comptime Tmem = TmemAllocation[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

`TmemDealloc`

comptime TmemDealloc = TmemDeallocBarrier[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group]

Methods

`mma`

static mma[tiles_origin: MutOrigin, //](tmem_stage: TmemStage[config.num_accum_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].stage_stride_cols, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group], tiles: InputConsumerStage[tiles_origin, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TilePayload, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_group_pipeline_stages, config.k_group_size], mma_op: MmaOpSM100_SS[mma_op.c_type, mma_op.a_type, mma_op.b_type, mma_op.block_tile_shape, mma_op.mma_shape, accum_type=mma_op.accum_type, cta_group=mma_op.cta_group, cluster_shape=mma_op.cluster_shape, a_swizzle=mma_op.a_swizzle, b_swizzle=mma_op.b_swizzle, transpose_b=mma_op.transpose_b], elect_one_warp: Bool, iter_idx: UInt32, k_start: UInt32)

Execute MMA operations for one pipeline stage.

`init_barriers`

static init_barriers(ctx: KernelContext[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].num_clc_pipeline_stages, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_M, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].CLUSTER_N], act_tma_op: TMATensorTileIm2col[act_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActDescLayout]()], filter_tma_op: TMATensorTile[filter_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterDescLayout]()], out_tma_op: TMATensorTile[out_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutDescLayout]()], input_barriers: SMemArray[SharedMemBarrier, (Conv2dSmem[act_type, filter_type, out_type, config=config].num_group_pipeline_stages * 2)], accum_barriers: SMemArray[SharedMemBarrier, (Conv2dSmem[act_type, filter_type, out_type, config=config].num_accum_pipeline_stages * 2)], clc_throttle: SMemArray[SharedMemBarrier, (Conv2dSmem[act_type, filter_type, out_type, config=config].num_clc_pipeline_stages * 2)], clc_full: SMemArray[SharedMemBarrier, Conv2dSmem[act_type, filter_type, out_type, config=config].num_clc_pipeline_stages], clc_empty: SMemArray[SharedMemBarrier, Conv2dSmem[act_type, filter_type, out_type, config=config].num_clc_pipeline_stages], tmem_dealloc: SMemArray[SharedMemBarrier, 1], epi_load_barriers: SMemArray[SharedMemBarrier, (Conv2dSmem[act_type, filter_type, out_type, config=config].num_epi_load_stages * 2)], load_order_barrier: SMemArray[SharedMemBarrier, 1])

Initialize barriers and prefetch TMA descriptors.

`load_input_tiles`

static load_input_tiles[act_tma_origin: ImmutOrigin, filter_tma_origin: ImmutOrigin, tiles_origin: MutOrigin, //](act_loader: TileLoaderTMAIm2col[act_tma_origin, act_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActDescLayout](), cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group], filter_loader: TileLoaderTMA[filter_tma_origin, filter_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterDescLayout](), cta_group=Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].cta_group], tiles: InputProducerStage[tiles_origin, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].TilePayload, Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SmemType.num_group_pipeline_stages, config.k_group_size], iter_idx: UInt32, work_m_coord: Scalar[DType.uint], work_n_coord: Scalar[DType.uint], peer_cta_coord: Tuple[UInt, UInt, UInt], elect_one_cta: Bool)

Load activation (via im2col TMA) and filter tiles.

The im2col TMA descriptor handles coordinate transformation internally. Coordinates are in GEMM space:

work_m_coord: M coordinate (batch * H_out * W_out)
work_n_coord: N coordinate (output channels)
iter_idx: K dimension tile index (C * R * S)

`run`

static run(act_tma_op: TMATensorTileIm2col[act_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActDescLayout]()], filter_tma_op: TMATensorTile[filter_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterDescLayout]()], out_tma_op: TMATensorTile[out_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutDescLayout]()], cluster_dim: StaticTuple[Int32, 3], mnk: StaticTuple[UInt32, 3])

Kernel entry point for Conv2D fprop (no residual).

Args:

act_tma_op (TMATensorTileIm2col): Im2col TMA descriptor for activation.
filter_tma_op (TMATensorTile): TMA descriptor for filter.
out_tma_op (TMATensorTile): TMA descriptor for output.
cluster_dim (StaticTuple): Cluster dimensions.
mnk (StaticTuple): GEMM dimensions (M, N, K).

`run_with_residual`

static run_with_residual(act_tma_op: TMATensorTileIm2col[act_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].ActDescLayout]()], filter_tma_op: TMATensorTile[filter_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].FilterDescLayout]()], out_tma_op: TMATensorTile[out_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].OutDescLayout]()], src_tma_op: TMATensorTile[out_type, _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SrcTileLayout](), _to_legacy_layout[Conv2dFpropKernel[act_type, filter_type, out_type, config, cluster_shape, elementwise_compute_lambda_fn, register_based_epilogue].SrcDescLayout]()], cluster_dim: StaticTuple[Int32, 3], mnk: StaticTuple[UInt32, 3], beta: Float32)

Kernel entry point for Conv2D fprop with residual (D = Conv + beta*C).

Args:

act_tma_op (TMATensorTileIm2col): Im2col TMA descriptor for activation.
filter_tma_op (TMATensorTile): TMA descriptor for filter.
out_tma_op (TMATensorTile): TMA descriptor for output D.
src_tma_op (TMATensorTile): TMA descriptor for source C (residual input).
cluster_dim (StaticTuple): Cluster dimensions.
mnk (StaticTuple): GEMM dimensions (M, N, K).
beta (Float32): Residual scale factor.

Parameters
Implemented traits
comptime members
Methods

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

Parameters​

Implemented traits​

comptime members​

__del__is_trivial​

accum_layout​

accum_pipeline_consumer_arv_count​

accum_pipeline_producer_arv_count​

accum_type​

AccumTensor​

act_expected_bytes​

act_swizzle_elems​

act_tile_dim0​

act_tma_load_size​

act_tma_rows​

ActDescLayout​

ActTileLayout​

ActTileLoaderTypeIm2col​

ActTmaOp​

BK​

BM​

BN​

clc_consumer_arv_count​

clc_producer_arv_count​

clc_throttle_consumer_arv_count​

clc_throttle_producer_arv_count​

CLUSTER_M​

CLUSTER_N​

CLUSTER_SIZE​

Context​

cta_group​

epi_load_consumer_arv_count​

epi_load_producer_arv_count​

EpiLoadPipelineType​

EPILOGUE_LOAD_THREADS​

EPILOGUE_THREADS​

EpilogueCtx​

filter_expected_bytes​

filter_swizzle_elems​

filter_tile_dim0​

filter_tma_load_size​

filter_tma_rows​

FilterDescLayout​

FilterTileLayout​

FilterTileLoaderType​

FilterTmaOp​

FilterTmaTile​

input_expected_bytes​

InputTilePipelineType​

MMA_K​

MMA_M​

MMA_N​

MMA_THREADS​

MmaCtx​

MmaEpilogueSync​

MmaOp​

num_accum_pipeline_stages​

num_clc_pipeline_stages​

num_epi_load_stages​

num_group_pipeline_stages​

num_output_stages​

num_output_warps​

num_pipeline_stages​

NUM_THREADS​

NUM_TMEM_COLS​

out_swizzle_elems​

out_tile_dim0​

OutDescLayout​

OutputM​

OutputN​

OutputPipeline​

OutTileLayout​

OutTmaOp​

OutTmaTile​

Scheduler​

SCHEDULER_THREADS​

SmemType​

src_expected_bytes​

SrcCTileArray​

SrcDescLayout​

SrcTileLayout​