Mojo struct
GenericToSharedAsyncTileCopier
struct GenericToSharedAsyncTileCopier[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], *, swizzle: Optional[Swizzle] = None, masked: Bool = False, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL, num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope.BLOCK]
An AsyncTileCopier that asynchronously moves a tile from generic memory into shared memory using NVIDIA's cp.async instruction.
On NVIDIA GPUs (compute capability 8.0+), the copy issues cp.async
instructions, allowing the transfer to overlap with subsequent compute.
On AMD and Apple GPUs the underlying async_copy intrinsic falls back
to synchronous loads and stores.
The copy is asynchronous on NVIDIA: callers must commit it via
async_copy_commit_group() and synchronize via async_copy_wait_all()
or async_copy_wait_group() before reading the destination tile.
The vector size in bytes (size_of[dtype]() * element_size) must be
4, 8, or 16.
Parameters
- thread_layout (
Layout): Layout describing how threads are organized over the copy. - swizzle (
Optional): Optional swizzle applied to the shared-memory destination for bank-conflict mitigation.Noneproduces a straight copy. Subsequent readers of the tile must use the same swizzle. - masked (
Bool): WhenTrue, performs per-vector bounds-checking againstsrc.dim[0]() * row_stride. Vectors that fall past the bound issue zero-bytecp.asyncoperations withfill=0, which the hardware fulfills by zeroing the destination bytes. Intended for source tiles whose row count is dynamic (e.g. attention prefill loading the tail of a sequence). - eviction_policy (
CacheEviction): Cache eviction policy for the source data. - num_threads (
Int): Total number of threads in the thread block. Threads beyondthread_layout.size()do not participate. - thread_scope (
ThreadScope): Scope at which thread operations are performed (BLOCKorWARP). Defaults toThreadScope.BLOCK.
Implemented traits
AnyType,
AsyncTileCopier,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable
comptime members
dst_address_space
comptime dst_address_space = AddressSpace.SHARED
Destination AddressSpace this copier writes to.
src_address_space
comptime src_address_space = AddressSpace.GENERIC
Source AddressSpace this copier reads from.
Methods
copy
copy[element_size: Int](self, dst: TileTensor[dst.dtype, dst.LayoutType, dst.origin, address_space=GenericToSharedAsyncTileCopier[thread_layout, swizzle=swizzle, masked=masked, eviction_policy=eviction_policy, num_threads=num_threads, thread_scope=thread_scope].dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size], src: TileTensor[src.dtype, src.LayoutType, src.origin, linear_idx_type=src.linear_idx_type, element_size=element_size])
Asynchronously copies src in generic memory into dst in shared memory.
The copy is issued via cp.async on NVIDIA. Callers must commit
and wait on the copy before using the destination tile.
Parameters:
- element_size (
Int): Number of scalar elements per logical element.
Args:
- dst (
TileTensor): Destination tile in shared memory. - src (
TileTensor): Source tile in generic memory.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!