For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

matmul_kernels

SM100 Default Matmul Kernel - Standard FP8/BF16 warp-specialized kernel.

This module contains the default SM100 matmul kernel implementation:

B200MatmulSmem: Shared memory layout for the kernel
BlackwellMatmulSM100Kernel: Main kernel struct with run() and run_splitk()
BlackwellMatmulSM100FallbackKernel: Simple fallback kernel

Shared components (WarpRole, KernelContext) are in kernel_common.mojo. Output pipeline (TileWriter, copy_accum_to_gmem) is in output_writer.mojo. Low-level epilogue components (TMAStoreExecutor, etc.) are in epilogue_components.mojo.

The kernel implements a warp-specialized architecture:

Scheduler warp: CLC-based tile scheduling
TMA Load warp: Async memory transfers
MMA warp: Tensor core operations with TMEM accumulators
Epilogue warps: Output from TMEM to GMEM via TileWriter

Structs

B200MatmulSmem: Shared memory layout for B200 SM100 matrix multiplication kernel.
BlackwellMatmulSM100FallbackKernel: Simple fallback matmul kernel for SM100 (B200).
BlackwellMatmulSM100Kernel: Blackwell SM100 GEMM kernel with warp specialization.

Structs​

Structs