Mojo module

blockwise_fp8_matmul_kernel

Blockwise FP8 SM100 matmul kernel - Structured kernel with register accumulation.

Unlike standard SM100 matmul which accumulates in TMEM, blockwise FP8 applies scaling factors per-K-iteration in CUDA cores, accumulating in registers.

Architecture:

Load warp: TMA loads A, B, and A-scales into SMEM
MMA warp: Standard MMA operations (partial results to TMEM)
Epilogue warp: Per-K TMEM read → scale → register accumulate → final output

Key differences from standard/block-scaled kernels:

Uses MmaOpSM100_SS (not block-scaled MMA)
A-scales loaded via TMA, B-scales from global memory
BlockwiseFP8Accumulator for register-based K-loop accumulation
BlockwiseFP8TileWriter for final register → SMEM → GMEM flow

`comptime` values

`UnsafePointer`

comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]

Structs

BlackwellBlockwiseFP8MatmulKernel: Blockwise FP8 matmul kernel with register-based accumulation.

comptime values​

UnsafePointer​

Structs​

`comptime` values

`UnsafePointer`

Structs