Skip to main content

Mojo module

blockwise_fp8_1d2d_matmul_kernel

Blockwise FP8 1D2D SM100 matmul kernel.

This kernel combines:

  • Accumulation pattern from blockwise_fp8/ (register-based per-K scaling via BlockwiseFP8Accumulator, standard MMA, A-scales in SMEM, B-scales from GMEM)
  • 1D2D work distribution from grouped_block_scaled_1d1d/ (GroupedWorkIterator1D1D, offset-based A tensor addressing, bounds-checked output, 3-warp specialization, SmemPipelineBundleNoClc)

Architecture:

  • TMA warp: Loads A, B, A-scales tiles using grid-constant TMAs
  • MMA warp: Standard MMA (partial results to TMEM, init_c=True every K iter)
  • Epilogue warps: Per-K TMEM read → scale → register accumulate → final output with bounds checking

Structs

Was this page helpful?