IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

matmul_kernels

SM100 Default Matmul Kernel - Standard FP8/BF16 warp-specialized kernel.

This module contains the default SM100 matmul kernel implementation:

  • B200MatmulSmem: Shared memory layout for the kernel
  • BlackwellMatmulSM100Kernel: Main kernel struct with run() and run_splitk()
  • BlackwellMatmulSM100FallbackKernel: Simple fallback kernel

Shared components (WarpRole, KernelContext) are in kernel_common.mojo. Output pipeline (TileWriter, copy_accum_to_gmem) is in output_writer.mojo. Low-level epilogue components (TMAStoreExecutor, etc.) are in epilogue_components.mojo.

The kernel implements a warp-specialized architecture:

  • Scheduler warp: CLC-based tile scheduling
  • TMA Load warp: Async memory transfers
  • MMA warp: Tensor core operations with TMEM accumulators
  • Epilogue warps: Output from TMEM to GMEM via TileWriter

Structs​