Mojo function

cp_async_bulk_tensor_reduce

cp_async_bulk_tensor_reduce[src_type: AnyType, rank: Int, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])

Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function performs an in-place reduction operation, combining data from shared memory with data in global memory using the specified reduction operation. The operation is performed asynchronously and uses TMA's tile mode for efficient memory access.

Notes:

This operation is asynchronous - use appropriate memory barriers to ensure completion.
Only supports rank-1 and rank-2 tensors.
Requires NVIDIA GPU with TMA support.
The source memory must be properly aligned for TMA operations.
The TMA descriptor must be properly initialized before use.
The reduction operation is performed atomically to ensure correctness.

Parameters:

src_type (AnyType): The data type of the source tensor elements.
rank (Int): The dimensionality of the tensor (must be 1 or 2).
reduction_kind (ReduceOp): The dtype of reduction operation to perform. Supported operations are: "add", "min", "max", "inc", "dec", "and", "or", "xor".
eviction_policy (CacheEviction): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to EVICT_NORMAL.

Args:

src_mem (UnsafePointer): Pointer to the source data in shared memory that will be reduced with the global memory data. Must be properly aligned according to TMA requirements.
tma_descriptor (UnsafePointer): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns.
coords (IndexList): Coordinates specifying which tile of the tensor to operate on. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates.