Mojo module
tile_writer
TileWriter components for SM100 matrix multiplication epilogue.
This module provides modular components for the output pipeline:
- store_fragment_to_smem: Register to shared memory via st.matrix instructions
- TMEMToSMemWriter: Write TMEM accumulators to shared memory
- TMAStoreExecutor: Execute TMA stores with proper SMEM tiling
- EpilogueApplier: Apply element-wise operations on fragments
The SM100 epilogue pipeline flows as: TMEM (accumulators) → Registers → SMEM → GMEM (via TMA)
comptime values
RLayout32Bits
comptime RLayout32Bits[layout: Layout] = RuntimeLayout[layout, element_type=DType.uint32, linear_idx_type=DType.uint32]
Parameters
- layout (
Layout):
Structs
-
AccumBarrier: Pipeline barrier helper for single-CTA vs 2-CTA arrival patterns. -
AccumTile: Upper + lower TMEM fragments (16 rows each) for SM100 output. -
EpilogueApplier: Apply element-wise epilogue lambda to register fragments. -
EpilogueConfig: Computed epilogue parameters based on MMA and CTA configuration. -
FragmentCoords: Fragment element coordinates for tcgen05 16x256b matrix layout. -
SMemEpilogueWriter: SMEM-based epilogue: write accumulators and apply lambda in SMEM. -
TMAStoreCoords: TMA store coordinates and warp election for SM100 epilogue. -
TMAStoreExecutor: Execute TMA store from SMEM to GMEM with proper tiling. -
TMEMToSMemWriter: Write TMEM accumulators to SMEM via st.matrix (SM100-specific).
Functions
-
shared_memory_epilogue: Apply element-wise epilogue to non-transposed SMEM tile. -
shared_memory_epilogue_transpose: Apply element-wise epilogue to transposed SMEM tile. -
store_fragment_to_smem: Store fragment to SMEM via st.matrix instruction. -
tma_wait_pipelined: Wait for TMA stores with pipelining.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!