Mojo function
shared_memory_epilogue
shared_memory_epilogue[MMA_M: Int, data_paths: Int, num_stages: Int, stage: Int, stageN: Int, c_type: DType, shared_n: Int, simd_size: Int, c_smem_upper_layout: Layout, c_smem_lower_layout: Layout, swizzle: Swizzle, compute_lambda_fn: def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> SIMD[dtype, width], num_output_warps: Int](M: UInt32, N: UInt32, c_col: Int, c_row: Int, c_smem_warp_tile_upper: TileTensor[c_type, c_smem_warp_tile_upper.LayoutType, c_smem_warp_tile_upper.origin, address_space=AddressSpace.SHARED, linear_idx_type=c_smem_warp_tile_upper.linear_idx_type, element_size=c_smem_warp_tile_upper.element_size], c_smem_warp_tile_lower: TileTensor[c_type, c_smem_warp_tile_lower.LayoutType, c_smem_warp_tile_lower.origin, address_space=AddressSpace.SHARED, linear_idx_type=c_smem_warp_tile_lower.linear_idx_type, element_size=c_smem_warp_tile_lower.element_size])
Apply element-wise epilogue to non-transposed SMEM tile.
Each warp processes upper (rows 0-15) and lower (rows 16-31) fragments. Uses distribute layout to map SIMD vectors to threads within each warp.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!