Mojo module
blockwise_fp8_1d2d_smem
Shared memory layout for blockwise FP8 1D2D SM100 matmul.
This is a simplified SMEM structure for the 1D2D blockwise FP8 kernel that uses offset-based addressing (GroupedWorkIterator1D1D). Key differences from the standard BlockwiseFP8Smem:
- No CLC pipeline storage - uses 3-warp specialization (no scheduler warp)
- Uses SmemPipelineBundleNoClc instead of SmemPipelineBundle
- Otherwise identical tile storage (A, B, C, A-scales)
The 1D-1D layout uses:
- A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
- B tensor: Batched (num_experts * N, K) weights
- C tensor: Contiguous (total_tokens, N) output
Structs
-
BlockwiseFP8_1D2DSmem: SMEM struct for blockwise FP8 1D2D matmul: A/B tiles, A-scales, C output, barriers.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!