Mojo function
store_output_tile_via_tma
store_output_tile_via_tma[c_type: DType, c_tma_layout: Layout, c_tile_layout: Layout, //, BM: Int, BN: Int, WG_BM: Int, WG_BN: Int, TMA_BN: Int](c_tma_op: TMATensorTile[c_type, c_tma_layout, desc_layout], c_tile: LayoutTensor[c_type, c_tile_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], local_thread_idx: UInt, block_x: Int, block_y: Int, sub_wg_bn_id: Int)
Store output tile to global memory using Tensor Memory Accelerator (TMA).
Uses NVIDIA's TMA hardware for efficient async memory transfers from shared memory to global memory with automatic address generation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!