Mojo module
matmul_output
SM100 Matmul Output Pipeline - TMEM → SMEM → GMEM epilogue.
This module contains the output pipeline code for SM100 matmul:
- copy_accum_to_gmem: Core epilogue pipeline (TMEM → Registers → SMEM → GMEM)
- multi_stage_store_C: Output pipeline orchestration for standard matmul
- multi_stage_store_C_split_k: Output pipeline for split-K matmul
The output pipeline handles:
- Loading accumulated results from Tensor Memory (TMEM)
- Applying optional epilogue operations (bias, activation)
- Writing to shared memory via st.matrix instructions
- Transferring to global memory via TMA async stores
Functions
-
accum_arrive: Signal accumulator arrival. Delegates to AccumBarrier. -
copy_accum_to_gmem: Epilogue pipeline: TMEM → Registers → SMEM → GMEM (via TMA). -
multi_stage_store_C: Orchestrate output from TMEM to GMEM via shared memory. -
multi_stage_store_C_split_k: Split-K output pipeline with reduction.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!