Mojo module
blockwise_fp8_1d2d_matmul_kernel
Blockwise FP8 1D2D SM100 matmul kernel.
This kernel combines:
- Accumulation pattern from blockwise_fp8/ (register-based per-K scaling via BlockwiseFP8Accumulator, standard MMA, A-scales in SMEM, B-scales from GMEM)
- 1D2D work distribution from grouped_block_scaled_1d1d/ (GroupedWorkIterator1D1D, offset-based A tensor addressing, bounds-checked output, 3-warp specialization, SmemPipelineBundleNoClc)
Architecture:
- TMA warp: Loads A, B, A-scales tiles using grid-constant TMAs
- MMA warp: Standard MMA (partial results to TMEM, init_c=True every K iter)
- Epilogue warps: Per-K TMEM read → scale → register accumulate → final output with bounds checking
Structs
-
BlockwiseFP8_1D2DMatmulKernel: Blockwise FP8 1D2D matmul kernel with register-based accumulation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!