Skip to main content

Mojo function

store_output_tile_via_tma

store_output_tile_via_tma[c_type: DType, c_tma_layout: Layout, c_tile_layout: Layout, //, BM: Int, BN: Int, WG_BM: Int, WG_BN: Int, TMA_BN: Int](c_tma_op: TMATensorTile[c_type, c_tma_layout, desc_layout], c_tile: LayoutTensor[c_type, c_tile_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], local_thread_idx: UInt, block_x: Int, block_y: Int, sub_wg_bn_id: Int)

Store output tile to global memory using Tensor Memory Accelerator (TMA).

Uses NVIDIA's TMA hardware for efficient async memory transfers from shared memory to global memory with automatic address generation.

Was this page helpful?