Skip to main content

Mojo function

cp_async_bulk_tensor_2d_gather4

cp_async_bulk_tensor_2d_gather4[dst_type: AnyType, mbr_type: AnyType, /, *, cta_group: Int = 1, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL](dst_mem: UnsafePointer[dst_type, dst_mem.origin, address_space=AddressSpace.SHARED], tma_descriptor: UnsafePointer[NoneType, tma_descriptor.origin], mem_bar: UnsafePointer[mbr_type, mem_bar.origin, address_space=AddressSpace.SHARED], col_idx: Int32, row0: Int32, row1: Int32, row2: Int32, row3: Int32)

Initiates an asynchronous gather4 copy of 4 non-contiguous rows from a 2D tensor in global memory into shared memory using TMA.

This is a Blackwell (SM100+) instruction that loads 4 rows at arbitrary row indices from a 2D tensor map, placing them contiguously in shared memory. The tensor map must be created with box dim1=1 (one row per tile). Each row is a full tile along the column dimension.

PTX: cp.async.bulk.tensor.2d.shared::cta.global.tile::gather4 .mbarrier::complete_tx::bytes

Constraints:

Requires SM100 (Blackwell) or newer GPU architecture.

Parameters:

  • dst_type (AnyType): The data type of the destination memory.
  • mbr_type (AnyType): The data type of the memory barrier.
  • cta_group (Int): The CTA group for the copy operation. Must be 1 or 2.
  • eviction_policy (CacheEviction): Cache eviction policy. Defaults to EVICT_NORMAL.

Args:

  • dst_mem (UnsafePointer): Pointer to the destination in shared memory. Must be 128-byte aligned.
  • tma_descriptor (UnsafePointer): Pointer to the TMA descriptor (created with cuTensorMapEncodeTiled and box dim1=1).
  • mem_bar (UnsafePointer): Pointer to a shared memory barrier for synchronization.
  • col_idx (Int32): Column offset in the source tensor (typically 0 for full-row loads).
  • row0 (Int32): Row index of the first row to gather.
  • row1 (Int32): Row index of the second row to gather.
  • row2 (Int32): Row index of the third row to gather.
  • row3 (Int32): Row index of the fourth row to gather.

Was this page helpful?