For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

block_scaled_matmul_kernel

Block-scaled SM100 matmul kernel - Structured kernel using tile pipelines.

Uses patterns from matmul_kernels.mojo with typed SMEM accessors and context manager-based pipeline synchronization for MXFP8 and NVFP4 block-scaled matrix multiplication.

Architecture:

Uses Self.SmemType (BlockScaledSmem) with typed tile/barrier accessors
Uses Self.InputTilePipeline (BlockScaledTilePipeline) for producer/consumer sync
Load warp: with input_pipeline.producer() as stage -> Self.load_input_tiles()
MMA warp: with input_pipeline.consumer() as stage -> Self.mma()
Epilogue warp: Uses structured building blocks from epilogue_components.mojo

Epilogue Building Blocks (from epilogue_components.mojo):

TmemArrayType / load_fragments() for TMEM load
AccumBarrier.arrive() for barrier signaling
TMEMToSMemWriter.write_fragments() for SMEM write
tma_wait_pipelined() for TMA wait
TMA store remains inline (3D batch coordinates)

Key structured patterns:

Context manager pattern for pipeline synchronization
ProducerStage/ConsumerStage encapsulate tiles and barriers
stage.get_tiles(j) returns (a, b, sfa, sfb) tuple
Automatic wait/step in context manager enter/exit

Structs

BlackwellBlockScaledMatmulKernel: Block-scaled matmul kernel V3 - ported from working legacy kernel.

Structs​

Structs