IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

blockwise_fp8_smem

Shared memory layout for blockwise FP8 SM100 matmul.

This module provides the SMEM struct for blockwise FP8 matmul kernels where:

  • A-scales are loaded via TMA and stored in SMEM (1D: 1 x BM per stage)
  • B-scales are read directly from global memory (not stored in SMEM)
  • Scaling is applied post-MMA in CUDA cores, not within the MMA unit

Unlike block-scaled matmul, blockwise FP8 uses register-based accumulation across K iterations, with scales applied per-iteration.

The tile storage, derived constants, layouts, and accessors are factored into BlockwiseFP8TileCore and shared with BlockwiseFP8_1D2DSmem. Each SMEM struct is a thin wrapper that adds the appropriate pipeline bundle.

Structsโ€‹