Skip to main content

/

Mojo module

blockwise_fp8_smem

Shared memory layout for blockwise FP8 SM100 matmul.

This module provides the SMEM struct for blockwise FP8 matmul kernels where:

A-scales are loaded via TMA and stored in SMEM (1D: 1 x BM per stage)
B-scales are read directly from global memory (not stored in SMEM)
Scaling is applied post-MMA in CUDA cores, not within the MMA unit

Unlike block-scaled matmul, blockwise FP8 uses register-based accumulation across K iterations, with scales applied per-iteration.

Structs

BlockwiseFP8Smem: SMEM struct for blockwise FP8 matmul: A/B tiles, A-scales, C output, barriers.

Structs

View source

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!