Mojo package

grouped_block_scaled_1d1d

Grouped block-scaled matmul with 1D-1D tensor layout for SM100.

This module provides a structured kernel implementation for grouped GEMM operations in Mixture of Experts (MoE) layers, using contiguous token buffers with offset-based addressing (the "1D-1D" layout).

Key characteristics:

A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
B tensor: Batched (num_experts, N, K) weights
C tensor: Contiguous (total_tokens, N) output
Per-expert output scaling via expert_scales tensor

This is a port of max/kernels/src/linalg/grouped_matmul_sm100_1d1d.mojo to the structured kernels architecture.

See PORTING_PLAN.md for implementation details.

Modules

grouped_1d1d_matmul: CPU entrypoint for grouped 1D-1D block-scaled SM100 matmul.
grouped_1d1d_matmul_kernel: Grouped 1D-1D block-scaled SM100 matmul kernel.
grouped_1d1d_smem: Shared memory layout for grouped 1D-1D block-scaled SM100 matmul.
grouped_1d1d_tile_scheduler: Work scheduler for grouped 1D-1D block-scaled SM100 matmul.

Modules​

Modules