Skip to main content

/

Mojo module

blockwise_fp8_accumulator

Register-based accumulator for blockwise FP8 matmul.

Unlike standard SM100 matmul which accumulates directly in TMEM, blockwise FP8 requires per-K-iteration scaling in CUDA cores:

for k in K_iterations:
    partial = TMEM load (MMA result)
    scaled = partial * a_scale * b_scale
    accum += scaled  # in registers
result = accum  # write to SMEM → GMEM

Structs

BlockwiseFP8Accumulator: Register-based accumulator for blockwise FP8 matmul.

Functions

get_accumulator_layout: Compute the register accumulator layout for blockwise FP8.
is_lower_fragment_required: Determine if lower TMEM fragment is needed based on config.

Structs
Functions

View source

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!