Mojo module
grouped_1d1d_smem
Shared memory layout for grouped 1D-1D block-scaled SM100 matmul.
This is a simplified SMEM structure for the 1D-1D kernel variant that uses offset-based addressing instead of pointer-per-group. Key differences from the standard GroupedBlockScaledSmem:
- No tensormap descriptors - TMAs are grid-constant (not updated per-group)
- No CLC pipeline storage - uses 3-warp specialization (no scheduler warp)
- Simpler barrier structure optimized for the 1D-1D workload
The 1D-1D layout uses:
- A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
- B tensor: Batched (num_experts, N, K) weights
- C tensor: Contiguous (total_tokens, N) output
Structs
-
Grouped1D1DSmem: SMEM struct for grouped 1D-1D block-scaled GEMM.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!