IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.

Skip to main content

For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

blockwise_fp8_accumulator

Register-based accumulator for blockwise FP8 matmul.

Unlike standard SM100 matmul which accumulates directly in TMEM, blockwise FP8 requires per-K-iteration scaling in CUDA cores:

for k in K_iterations:
    partial = TMEM load (MMA result)
    scaled = partial * a_scale * b_scale
    accum += scaled  # in registers
result = accum  # write to SMEM → GMEM

Structs

BlockwiseFP8Accumulator: Register-based accumulator for blockwise FP8 matmul.

Functions

get_accumulator_dims: Compute register accumulator dimensions for blockwise FP8.
is_lower_fragment_required: Determine if lower TMEM fragment is needed based on config.

Structs
Functions