Mojo function

cp_async_bulk_tensor_global_shared_cta

cp_async_bulk_tensor_global_shared_cta[src_type: AnyType, rank: Int, /, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])

Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function provides an efficient way to write data back from shared memory to global memory using TMA. It supports both rank-1 and rank-2 tensors and allows control over cache eviction policy.

Notes:

This operation is asynchronous - use appropriate memory barriers to ensure completion.
Only supports rank-1 and rank-2 tensors.
Requires NVIDIA GPU with TMA support.
The source memory must be properly aligned for TMA operations.
The TMA descriptor must be properly initialized before use.

Parameters:

src_type (AnyType): The data type of the source tensor elements.
rank (Int): The dimensionality of the tensor (must be 1 or 2).
eviction_policy (CacheEviction): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to EVICT_NORMAL.

Args:

src_mem (UnsafePointer[src_type, address_space=AddressSpace(3)]): Pointer to the source data in shared memory that will be copied to global memory. Must be properly aligned according to TMA requirements.
tma_descriptor (UnsafePointer[NoneType]): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns.
coords (IndexList[rank]): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates.