IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

blockwise_fp8_matmul_kernel

Blockwise FP8 SM100 matmul kernel - Structured kernel with register accumulation.

Unlike standard SM100 matmul which accumulates in TMEM, blockwise FP8 applies scaling factors per-K-iteration in CUDA cores, accumulating in registers.

Architecture:

  • Load warp: TMA loads A, B, and A-scales into SMEM
  • MMA warp: Standard MMA operations (partial results to TMEM)
  • Epilogue warp: Per-K TMEM read β†’ scale β†’ register accumulate β†’ final output

Key differences from standard/block-scaled kernels:

  • Uses MmaOpSM100_SS (not block-scaled MMA)
  • A-scales loaded via TMA, B-scales from global memory
  • BlockwiseFP8Accumulator for register-based K-loop accumulation
  • BlockwiseFP8TileWriter for final register β†’ SMEM β†’ GMEM flow

Structs​