Mojo module
blockwise_fp8_smem
Shared memory layout for blockwise FP8 SM100 matmul.
This module provides the SMEM struct for blockwise FP8 matmul kernels where:
- A-scales are loaded via TMA and stored in SMEM (1D: 1 x BM per stage)
- B-scales are read directly from global memory (not stored in SMEM)
- Scaling is applied post-MMA in CUDA cores, not within the MMA unit
Unlike block-scaled matmul, blockwise FP8 uses register-based accumulation across K iterations, with scales applied per-iteration.
The tile storage, derived constants, layouts, and accessors are factored into BlockwiseFP8TileCore and shared with BlockwiseFP8_1D2DSmem. Each SMEM struct is a thin wrapper that adds the appropriate pipeline bundle.
Structsโ
- โ
BlockwiseFP8Smem: SMEM struct for blockwise FP8 matmul with CLC scheduler pipeline. - โ
BlockwiseFP8TileCore: Core tile storage for blockwise FP8 matmul SMEM structs.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!