Skip to main content

Mojo module

blockwise_fp8_accumulator

Register-based accumulator for blockwise FP8 matmul.

Unlike standard SM100 matmul which accumulates directly in TMEM, blockwise FP8 requires per-K-iteration scaling in CUDA cores:

for k in K_iterations:
    partial = TMEM load (MMA result)
    scaled = partial * a_scale * b_scale
    accum += scaled  # in registers
result = accum  # write to SMEM → GMEM

Structs

Functions

Was this page helpful?