Mojo function

store_output_tile_via_tma

store_output_tile_via_tma[c_type: DType, c_tma_layout: Layout, c_tile_layout: Layout, //, BM: Int, BN: Int, WG_BM: Int, WG_BN: Int, TMA_BN: Int](c_tma_op: TMATensorTile[c_type, c_tma_layout, desc_layout], c_tile: LayoutTensor[c_type, c_tile_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], local_thread_idx: UInt, block_x: Int, block_y: Int, sub_wg_bn_id: Int)

Store output tile to global memory using Tensor Memory Accelerator (TMA).

Uses NVIDIA's TMA hardware for efficient async memory transfers from shared memory to global memory with automatic address generation.