Skip to main content

Mojo module

block_scaled_matmul_kernel

Block-scaled SM100 matmul kernel - Structured kernel using tile pipelines.

Uses patterns from matmul_kernels.mojo with typed SMEM accessors and context manager-based pipeline synchronization for MXFP8 and NVFP4 block-scaled matrix multiplication.

Architecture:

  • Uses Self.SmemType (BlockScaledSmem) with typed tile/barrier accessors
  • Uses Self.InputTilePipeline (BlockScaledTilePipeline) for producer/consumer sync
  • Load warp: with input_pipeline.producer() as stage -> Self.load_input_tiles()
  • MMA warp: with input_pipeline.consumer() as stage -> Self.mma()
  • Epilogue warp: Uses structured building blocks from tile_writer.mojo

Epilogue Building Blocks (from tile_writer.mojo):

  • TmemArrayType / load_fragments() for TMEM load
  • AccumBarrier.arrive() for barrier signaling
  • TMEMToSMemWriter.write_fragments() for SMEM write
  • tma_wait_pipelined() for TMA wait
  • TMA store remains inline (3D batch coordinates)

Key structured patterns:

  • Context manager pattern for pipeline synchronization
  • ProducerStage/ConsumerStage encapsulate tiles and barriers
  • stage.get_tiles(j) returns (a, b, sfa, sfb) tuple
  • Automatic wait/step in context manager enter/exit

comptime values​

UnsafePointer​

comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]

Structs​

Was this page helpful?