For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo package

grouped_block_scaled_1d1d

Grouped block-scaled matmul with 1D-1D tensor layout for SM100.

This module provides a structured kernel implementation for grouped GEMM operations in Mixture of Experts (MoE) layers, using contiguous token buffers with offset-based addressing (the "1D-1D" layout).

Key characteristics:

A tensor: Contiguous (total_tokens, K) with a_offsets for per-group access
B tensor: Batched (num_experts, N, K) weights
C tensor: Contiguous (total_tokens, N) output
Per-expert output scaling via expert_scales tensor

This is a port of max/kernels/src/linalg/grouped_matmul_sm100_1d1d.mojo to the structured kernels architecture.

See PORTING_PLAN.md for implementation details.

Modules

dispatch: Dispatch logic for grouped 1D-1D block-scaled SM100 matmul.
grouped_1d1d_matmul: CPU entrypoint for grouped 1D-1D block-scaled SM100 matmul.
grouped_1d1d_matmul_kernel: Grouped 1D-1D block-scaled SM100 matmul kernel.
grouped_1d1d_smem: Shared memory layout for grouped 1D-1D block-scaled SM100 matmul.
grouped_1d1d_tile_scheduler: Work scheduler for grouped 1D-1D block-scaled SM100 matmul.
grouped_matmul_block_scaled_swiglu: Unified NVFP4/MXFP8 grouped block-scaled matmul + SwiGLU dispatch.
grouped_matmul_swiglu_mxfp8: Unified dispatch for SwiGLU + MXFP8 grouped matmul.
grouped_matmul_swiglu_nvfp4: Unified dispatch for SwiGLU + NVFP4 grouped matmul.

Modules​

Modules