Mojo module
block_scaled_matmul_kernels
Block-scaled SM100 matmul kernel for MXFP8 matrix multiplication.
Warp-specialized architecture:
- Scheduler: CLC-based tile distribution
- TMA Load: Async loads for A, B, and their scaling factors (SFA, SFB)
- MMA: Block-scaled tensor core ops with TMEM accumulators
- Epilogue: TMEM → SMEM → GMEM output pipeline
comptime values
UnsafePointer
comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]
Structs
-
BlackwellBlockScaledMatmulKernel: SM100 block-scaled GEMM kernel for MXFP8 (FP8 with microscaling). -
BlockScaledKernelContext: Per-CTA state: election flags, coordinates, multicast masks, TMEM offsets.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!