IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

blockwise_fp8_accumulator

Register-based accumulator for blockwise FP8 matmul.

Unlike standard SM100 matmul which accumulates directly in TMEM, blockwise FP8 requires per-K-iteration scaling in CUDA cores:

for k in K_iterations:
    partial = TMEM load (MMA result)
    scaled = partial * a_scale * b_scale
    accum += scaled  # in registers
result = accum  # write to SMEM β†’ GMEM

Structs​

Functions​