Skip to main content

Mojo module

blockwise_fp8_matmul_kernel

Blockwise FP8 SM100 matmul kernel - Structured kernel with register accumulation.

Unlike standard SM100 matmul which accumulates in TMEM, blockwise FP8 applies scaling factors per-K-iteration in CUDA cores, accumulating in registers.

Architecture:

  • Load warp: TMA loads A, B, and A-scales into SMEM
  • MMA warp: Standard MMA operations (partial results to TMEM)
  • Epilogue warp: Per-K TMEM read → scale → register accumulate → final output

Key differences from standard/block-scaled kernels:

  • Uses MmaOpSM100_SS (not block-scaled MMA)
  • A-scales loaded via TMA, B-scales from global memory
  • BlockwiseFP8Accumulator for register-based K-loop accumulation
  • BlockwiseFP8TileWriter for final register → SMEM → GMEM flow

comptime values

UnsafePointer

comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]

Structs

Was this page helpful?