Mojo module

block_scaled_matmul_kernel

Block-scaled SM100 matmul kernel - Structured kernel using tile pipelines.

Uses patterns from matmul_kernels.mojo with typed SMEM accessors and context manager-based pipeline synchronization for MXFP8 and NVFP4 block-scaled matrix multiplication.

Architecture:

Uses Self.SmemType (BlockScaledSmem) with typed tile/barrier accessors
Uses Self.InputTilePipeline (BlockScaledTilePipeline) for producer/consumer sync
Load warp: with input_pipeline.producer() as stage -> Self.load_input_tiles()
MMA warp: with input_pipeline.consumer() as stage -> Self.mma()
Epilogue warp: Uses structured building blocks from tile_writer.mojo

Epilogue Building Blocks (from tile_writer.mojo):

TmemArrayType / load_fragments() for TMEM load
AccumBarrier.arrive() for barrier signaling
TMEMToSMemWriter.write_fragments() for SMEM write
tma_wait_pipelined() for TMA wait
TMA store remains inline (3D batch coordinates)

Key structured patterns:

Context manager pattern for pipeline synchronization
ProducerStage/ConsumerStage encapsulate tiles and barriers
stage.get_tiles(j) returns (a, b, sfa, sfb) tuple
Automatic wait/step in context manager enter/exit

`comptime` values

`UnsafePointer`

comptime UnsafePointer = LegacyUnsafePointer[?, address_space=?, origin=?]

Structs

BlackwellBlockScaledMatmulKernel: Block-scaled matmul kernel V3 - ported from working legacy kernel.

comptime values​

UnsafePointer​

Structs​

`comptime` values

`UnsafePointer`

Structs