Skip to main content

Mojo function

cp_async_bulk_tensor_shared_cluster_global_im2col

cp_async_bulk_tensor_shared_cluster_global_im2col[dst_type: AnyType, mbr_type: AnyType, tensor_rank: Int, /, *, cta_group: Int = 1](dst_mem: UnsafePointer[dst_type, dst_mem.origin, address_space=AddressSpace.SHARED], tma_descriptor: UnsafePointer[NoneType, tma_descriptor.origin], mem_bar: UnsafePointer[mbr_type, mem_bar.origin, address_space=AddressSpace.SHARED], coords: IndexList[tensor_rank], filter_offsets: IndexList[(tensor_rank - 2)])

Initiates an asynchronous TMA load with im2col addressing for convolution.

This function performs a TMA load using im2col mode, which applies coordinate transformation suitable for implicit GEMM convolution. The TMA descriptor must be created with cuTensorMapEncodeIm2col.

For 2D convolution with 4D NHWC tensor:

  • coords: (c, w, h, n) - channel, output spatial, batch
  • filter_offsets: (offset_w, offset_h) - position within filter window

PTX instruction formats differ based on cta_group:

  • cta_group=1: Uses SM90-style PTX (no cta_group modifier) cp.async.bulk.tensor.4d.shared::cluster.global.im2col...
  • cta_group=2: Uses SM100-style PTX with cta_group::2 (from CUTLASS) cp.async.bulk.tensor.4d.im2col.cta_group::2.shared::cluster.global...

Parameters:

  • dst_type (AnyType): The data type of the destination memory.
  • mbr_type (AnyType): The data type of the memory barrier.
  • tensor_rank (Int): The rank of the tensor (3, 4, or 5).
  • cta_group (Int): The CTA group to use for the copy operation. Must be 1 or 2.

Args:

  • dst_mem (UnsafePointer): Pointer to the destination in shared memory.
  • tma_descriptor (UnsafePointer): Pointer to the TMA im2col descriptor.
  • mem_bar (UnsafePointer): Pointer to the shared memory barrier.
  • coords (IndexList): Tensor coordinates (c, w, h, n for 4D).
  • filter_offsets (IndexList): Filter window offsets (offset_w, offset_h for 4D).

Was this page helpful?