Skip to main content

Mojo struct

GenericToSharedAsyncTileCopier

struct GenericToSharedAsyncTileCopier[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], *, swizzle: Optional[Swizzle] = None, masked: Bool = False, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL, num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope.BLOCK]

An AsyncTileCopier that asynchronously moves a tile from generic memory into shared memory using NVIDIA's cp.async instruction.

On NVIDIA GPUs (compute capability 8.0+), the copy issues cp.async instructions, allowing the transfer to overlap with subsequent compute. On AMD and Apple GPUs the underlying async_copy intrinsic falls back to synchronous loads and stores.

The copy is asynchronous on NVIDIA: callers must commit it via async_copy_commit_group() and synchronize via async_copy_wait_all() or async_copy_wait_group() before reading the destination tile.

The vector size in bytes (size_of[dtype]() * element_size) must be 4, 8, or 16.

Parameters

  • thread_layout (Layout): Layout describing how threads are organized over the copy.
  • swizzle (Optional): Optional swizzle applied to the shared-memory destination for bank-conflict mitigation. None produces a straight copy. Subsequent readers of the tile must use the same swizzle.
  • masked (Bool): When True, performs per-vector bounds-checking against src.dim[0]() * row_stride. Vectors that fall past the bound issue zero-byte cp.async operations with fill=0, which the hardware fulfills by zeroing the destination bytes. Intended for source tiles whose row count is dynamic (e.g. attention prefill loading the tail of a sequence).
  • eviction_policy (CacheEviction): Cache eviction policy for the source data.
  • num_threads (Int): Total number of threads in the thread block. Threads beyond thread_layout.size() do not participate.
  • thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to ThreadScope.BLOCK.

Implemented traits

AnyType, AsyncTileCopier, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable

comptime members

dst_address_space

comptime dst_address_space = AddressSpace.SHARED

Destination AddressSpace this copier writes to.

src_address_space

comptime src_address_space = AddressSpace.GENERIC

Source AddressSpace this copier reads from.

Methods

copy

copy[element_size: Int](self, dst: TileTensor[dst.dtype, dst.LayoutType, dst.origin, address_space=GenericToSharedAsyncTileCopier[thread_layout, swizzle=swizzle, masked=masked, eviction_policy=eviction_policy, num_threads=num_threads, thread_scope=thread_scope].dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size], src: TileTensor[src.dtype, src.LayoutType, src.origin, linear_idx_type=src.linear_idx_type, element_size=element_size])

Asynchronously copies src in generic memory into dst in shared memory.

The copy is issued via cp.async on NVIDIA. Callers must commit and wait on the copy before using the destination tile.

Parameters:

  • element_size (Int): Number of scalar elements per logical element.

Args:

  • dst (TileTensor): Destination tile in shared memory.
  • src (TileTensor): Source tile in generic memory.

Was this page helpful?