Mojo module
tile_writer
TileWriter components for SM100 matrix multiplication epilogue.
This module provides modular components for the output pipeline:
- TMAStoreWriter: TMA async store from shared memory to global memory
- StMatrixWriter: Register to shared memory via st.matrix instructions
- TMEMReader: Load accumulator data from tensor memory to registers
- EpilogueApplier: Apply element-wise operations on fragments
The SM100 epilogue pipeline flows as: TMEM (accumulators) → Registers → SMEM → GMEM (via TMA)
Usage: # TMA store from shared memory to global memory var tma_writer = TMAStoreWriter... tma_writer.store_tile(c_smem_tile, (n_coord, m_coord))
comptime values
RLayout32Bits
comptime RLayout32Bits[layout: Layout] = RuntimeLayout[layout, element_type=DType.uint32, linear_idx_type=DType.uint32]
Parameters
- layout (
Layout):
ThreadwiseStoreWriter
comptime ThreadwiseStoreWriter = TileWriterThreadwise[?, ?, ?]
TMAStoreWriter
comptime TMAStoreWriter = TileWriterTMA
Structs
-
AccumBarrier: Helper for accumulator pipeline barrier operations. -
AccumTile: Accumulator tile holding upper and lower fragment data. -
EpilogueApplier: Apply element-wise epilogue operations on register fragments. -
EpilogueConfig: Configuration for epilogue stage computations. -
FragmentCoords: Compute coordinates for fragment elements in tensor memory layout. -
OutputStageWriter: Orchestrate writing a single output stage. -
SMemEpilogueWriter: Write accumulator tile to SMEM and apply element-wise epilogue lambda. -
StMatrixConfig: Configuration for st.matrix store operations. -
StMatrixCoords: Compute coordinates for st.matrix operations. -
StMatrixWriter: Write register fragments to shared memory using st.matrix. -
TMAStoreCoords: Compute TMA store coordinates and warp election for SM100 epilogue. -
TMAStoreExecutor: Execute TMA store from shared memory to global memory with proper tiling. -
TMEMFragment: Accumulator fragment pair from tensor memory. -
TMEMReader: Load accumulator fragments from tensor memory (TMEM). -
TMEMToSMemWriter: Write TMEM accumulator fragments to shared memory for SM100.
Functions
-
load_tmem_fragments: Load upper and lower fragments from TMEM and cast to epilogue type. -
shared_memory_epilogue: Apply element-wise epilogue to non-transposed shared memory tile. -
shared_memory_epilogue_transpose: Apply element-wise epilogue to transposed shared memory tile. -
store_fragment_to_smem: Store a fragment to shared memory using st.matrix. -
tma_store_with_pipeline: Perform TMA store with pipelined commit and wait. -
tma_wait_pipelined: Wait for TMA stores with pipelining.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!