Mojo module
matmul_kernels
SM100 Matmul Kernel Structs - GPU kernel entry points and helpers.
This module contains the GPU kernel structs for SM100 matmul:
- WarpRole: Warp specialization roles (MMA, Load, Scheduler, Epilogue)
- KernelContext: Common kernel state (election vars, CTA coords, masks)
- B200MatmulSmem: Shared memory layout for the kernel
- BlackwellMatmulSM100Kernel: Main kernel struct with run() and run_splitk()
- BlackwellMatmulSM100FallbackKernel: Simple fallback kernel
- consumer_main_loop: MMA consumer loop (for external callers)
Output pipeline functions (copy_accum_to_gmem, multi_stage_store_C) are in matmul_output.mojo.
The kernel implements a warp-specialized architecture:
- Scheduler warp: CLC-based tile scheduling
- TMA Load warp: Async memory transfers
- MMA warp: Tensor core operations with TMEM accumulators
- Epilogue warps: Output from TMEM to GMEM (see matmul_output.mojo)
comptime values
RLayout32Bits
comptime RLayout32Bits[layout: Layout] = RuntimeLayout[layout, element_type=DType.uint32, linear_idx_type=DType.uint32]
Parameters
- layout (
Layout):
Structs
-
B200MatmulSmem: Shared memory layout for B200 SM100 matrix multiplication kernel. -
BlackwellMatmulSM100FallbackKernel: Simple fallback matmul kernel for SM100 (B200). -
BlackwellMatmulSM100Kernel: Blackwell SM100 GEMM kernel with warp specialization. -
KernelContext: Shared kernel state: election vars, CTA coords, multicast masks, pipeline states. -
WarpRole: Warp role identifiers for SM100 warp-specialized kernel.
Functions
-
consumer_main_loop: Consume tiles from shared memory and execute MMA operations. -
f32_frag_to_smem: -
stsm_helper: Store a fragment to shared memory using st.matrix.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!