Mojo package
grouped_block_scaled_1d1d
Grouped block-scaled matmul with 1D-1D tensor layout for SM100.
This module provides a structured kernel implementation for grouped GEMM operations in Mixture of Experts (MoE) layers, using contiguous token buffers with offset-based addressing (the "1D-1D" layout).
Key characteristics:
- A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
- B tensor: Batched (num_experts, N, K) weights
- C tensor: Contiguous (total_tokens, N) output
- Per-expert output scaling via expert_scales tensor
This is a port of max/kernels/src/linalg/grouped_matmul_sm100_1d1d.mojo
to the structured kernels architecture.
See PORTING_PLAN.md for implementation details.
Modules
-
grouped_1d1d_matmul: CPU entrypoint for grouped 1D-1D block-scaled SM100 matmul. -
grouped_1d1d_matmul_kernel: Grouped 1D-1D block-scaled SM100 matmul kernel. -
grouped_1d1d_smem: Shared memory layout for grouped 1D-1D block-scaled SM100 matmul. -
grouped_1d1d_tile_scheduler: Work scheduler for grouped 1D-1D block-scaled SM100 matmul.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!