For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

blockwise_fp8_matmul_kernel

Blockwise FP8 SM100 matmul kernel - Structured kernel with register accumulation.

Unlike standard SM100 matmul which accumulates in TMEM, blockwise FP8 applies scaling factors per-K-iteration in CUDA cores, accumulating in registers.

Architecture:

Load warp: TMA loads A, B, and A-scales into SMEM
MMA warp: Standard MMA operations (partial results to TMEM)
Epilogue warp: Per-K TMEM read → scale → register accumulate → final output

Key differences from standard/block-scaled kernels:

Uses MmaOpSM100_SS (not block-scaled MMA)
A-scales loaded via TMA, B-scales from global memory
BlockwiseFP8Accumulator for register-based K-loop accumulation
BlockwiseFP8TileWriter for final register → SMEM → GMEM flow

Structs

BlackwellBlockwiseFP8MatmulKernel: Blockwise FP8 matmul kernel with register-based accumulation.

Structs​

Structs