Mojo module
blockwise_fp8_matmul_kernel
Blockwise FP8 SM100 matmul kernel - Structured kernel with register accumulation.
Unlike standard SM100 matmul which accumulates in TMEM, blockwise FP8 applies scaling factors per-K-iteration in CUDA cores, accumulating in registers.
Architecture:
- Load warp: TMA loads A, B, and A-scales into SMEM
- MMA warp: Standard MMA operations (partial results to TMEM)
- Epilogue warp: Per-K TMEM read → scale → register accumulate → final output
Key differences from standard/block-scaled kernels:
- Uses MmaOpSM100_SS (not block-scaled MMA)
- A-scales loaded via TMA, B-scales from global memory
- BlockwiseFP8Accumulator for register-based K-loop accumulation
- BlockwiseFP8TileWriter for final register → SMEM → GMEM flow
comptime values
UnsafePointer
comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]
Structs
-
BlackwellBlockwiseFP8MatmulKernel: Blockwise FP8 matmul kernel with register-based accumulation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!