Skip to main content

Mojo module

grouped_1d1d_smem

Shared memory layout for grouped 1D-1D block-scaled SM100 matmul.

This is a simplified SMEM structure for the 1D-1D kernel variant that uses offset-based addressing instead of pointer-per-group. Key differences from the standard GroupedBlockScaledSmem:

  1. No tensormap descriptors - TMAs are grid-constant (not updated per-group)
  2. No CLC pipeline storage - uses 3-warp specialization (no scheduler warp)
  3. Simpler barrier structure optimized for the 1D-1D workload

The 1D-1D layout uses:

  • A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
  • B tensor: Batched (num_experts, N, K) weights
  • C tensor: Contiguous (total_tokens, N) output

Structs

Was this page helpful?