Mojo module
block_scaled_matmul_kernel
Block-scaled SM100 matmul kernel - Structured kernel using tile pipelines.
Uses patterns from matmul_kernels.mojo with typed SMEM accessors and context manager-based pipeline synchronization for MXFP8 and NVFP4 block-scaled matrix multiplication.
Architecture:
- Uses Self.SmemType (BlockScaledSmem) with typed tile/barrier accessors
- Uses Self.InputTilePipeline (BlockScaledTilePipeline) for producer/consumer sync
- Load warp: with input_pipeline.producer() as stage -> Self.load_input_tiles()
- MMA warp: with input_pipeline.consumer() as stage -> Self.mma()
- Epilogue warp: Uses structured building blocks from tile_writer.mojo
Epilogue Building Blocks (from tile_writer.mojo):
- TmemArrayType / load_fragments() for TMEM load
- AccumBarrier.arrive() for barrier signaling
- TMEMToSMemWriter.write_fragments() for SMEM write
- tma_wait_pipelined() for TMA wait
- TMA store remains inline (3D batch coordinates)
Key structured patterns:
- Context manager pattern for pipeline synchronization
- ProducerStage/ConsumerStage encapsulate tiles and barriers
- stage.get_tiles(j) returns (a, b, sfa, sfb) tuple
- Automatic wait/step in context manager enter/exit
comptime valuesβ
UnsafePointerβ
comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]
Structsβ
- β
BlackwellBlockScaledMatmulKernel: Block-scaled matmul kernel V3 - ported from working legacy kernel.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!