Mojo module
matmul_kernels
SM100 Default Matmul Kernel - Standard FP8/BF16 warp-specialized kernel.
This module contains the default SM100 matmul kernel implementation:
- B200MatmulSmem: Shared memory layout for the kernel
- BlackwellMatmulSM100Kernel: Main kernel struct with run() and run_splitk()
- BlackwellMatmulSM100FallbackKernel: Simple fallback kernel
Shared components (WarpRole, KernelContext) are in kernel_common.mojo. Output pipeline (TileWriter, copy_accum_to_gmem) is in output_writer.mojo. Low-level epilogue components (TMAStoreExecutor, etc.) are in epilogue_components.mojo.
The kernel implements a warp-specialized architecture:
- Scheduler warp: CLC-based tile scheduling
- TMA Load warp: Async memory transfers
- MMA warp: Tensor core operations with TMEM accumulators
- Epilogue warps: Output from TMEM to GMEM via TileWriter
comptime values
UnsafePointer
comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]
Structs
-
B200MatmulSmem: Shared memory layout for B200 SM100 matrix multiplication kernel. -
BlackwellMatmulSM100FallbackKernel: Simple fallback matmul kernel for SM100 (B200). -
BlackwellMatmulSM100Kernel: Blackwell SM100 GEMM kernel with warp specialization.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!