IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

block_scaled_matmul_kernel

Block-scaled SM100 matmul kernel - Structured kernel using tile pipelines.

Uses patterns from matmul_kernels.mojo with typed SMEM accessors and context manager-based pipeline synchronization for MXFP8 and NVFP4 block-scaled matrix multiplication.

Architecture:

  • Uses Self.SmemType (BlockScaledSmem) with typed tile/barrier accessors
  • Uses Self.InputTilePipeline (BlockScaledTilePipeline) for producer/consumer sync
  • Load warp: with input_pipeline.producer() as stage -> Self.load_input_tiles()
  • MMA warp: with input_pipeline.consumer() as stage -> Self.mma()
  • Epilogue warp: Uses structured building blocks from epilogue_components.mojo

Epilogue Building Blocks (from epilogue_components.mojo):

  • TmemArrayType / load_fragments() for TMEM load
  • AccumBarrier.arrive() for barrier signaling
  • TMEMToSMemWriter.write_fragments() for SMEM write
  • tma_wait_pipelined() for TMA wait
  • TMA store remains inline (3D batch coordinates)

Key structured patterns:

  • Context manager pattern for pipeline synchronization
  • ProducerStage/ConsumerStage encapsulate tiles and barriers
  • stage.get_tiles(j) returns (a, b, sfa, sfb) tuple
  • Automatic wait/step in context manager enter/exit

Structs