Skip to main content

Mojo module

blockwise_fp8_1d2d_smem

Shared memory layout for blockwise FP8 1D2D SM100 matmul.

This is a simplified SMEM structure for the 1D2D blockwise FP8 kernel that uses offset-based addressing (GroupedWorkIterator1D1D). Key differences from the standard BlockwiseFP8Smem:

  1. No CLC pipeline storage - uses 3-warp specialization (no scheduler warp)
  2. Uses SmemPipelineBundleNoClc instead of SmemPipelineBundle
  3. Otherwise identical tile storage (A, B, C, A-scales)

The 1D-1D layout uses:

  • A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
  • B tensor: Batched (num_experts * N, K) weights
  • C tensor: Contiguous (total_tokens, N) output

Structs

Was this page helpful?