Mojo function

shared_memory_epilogue_transpose

shared_memory_epilogue_transpose[stage: UInt, stageN: UInt, c_type: DType, c_smem_layout: Layout, swizzle: Swizzle, compute_lambda_fn: elementwise_compute_lambda_type, num_output_warps: UInt, warp_dim: UInt, MMA_M: Int, BN: Int, cta_group: Int](M: UInt32, N: UInt32, c_col: UInt, c_row: UInt, c_smem: LayoutTensor[c_type, c_smem_layout, MutAnyOrigin, address_space=AddressSpace.SHARED, alignment=128], warp_i: UInt, warp_j: UInt)

Apply element-wise epilogue to transposed SMEM tile.

Supports warp_dim=1 (stageN, warp_i, U) or warp_dim=2 (warp_j, stageN, warp_i, UL).