Mojo module
blockwise_fp8_accumulator
Register-based accumulator for blockwise FP8 matmul.
Unlike standard SM100 matmul which accumulates directly in TMEM, blockwise FP8 requires per-K-iteration scaling in CUDA cores:
for k in K_iterations:
partial = TMEM load (MMA result)
scaled = partial * a_scale * b_scale
accum += scaled # in registers
result = accum # write to SMEM → GMEMStructs
-
BlockwiseFP8Accumulator: Register-based accumulator for blockwise FP8 matmul.
Functions
-
get_accumulator_layout: Compute the register accumulator layout for blockwise FP8. -
is_lower_fragment_required: Determine if lower TMEM fragment is needed based on config.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!