Skip to main content

Mojo struct

TileLoaderCPAsync

@register_passable(trivial) struct TileLoaderCPAsync[dtype: DType, src_layout: Layout, thread_layout: Layout, swizzle_mode: TensorMapSwizzle, vector_size: Int]

Software-based tile loader using cp.async instructions.

This loader uses CUDA's cp.async instructions for asynchronous memory transfers with manual bounds checking and shared memory swizzling for optimal bank conflict avoidance.

Parameters

  • dtype (DType): Data type of the elements being loaded.
  • src_layout (Layout): Layout of the source matrix in global memory.
  • thread_layout (Layout): Thread arrangement for distributed copying.
  • swizzle_mode (TensorMapSwizzle): Swizzling pattern for shared memory access.
  • vector_size (Int): Number of elements loaded per thread.

Fields

  • src (LayoutTensor[dtype, src_layout, MutAnyOrigin]):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, Movable, TileLoader, UnknownDestructibility

Aliases

__copyinit__is_trivial

alias __copyinit__is_trivial = True

__del__is_trivial

alias __del__is_trivial = True

__moveinit__is_trivial

alias __moveinit__is_trivial = True

Methods

__init__

__init__(src: LayoutTensor[dtype, src_layout, MutAnyOrigin]) -> Self

Initialize the cp.async tile loader.

Args:

load_tile

load_tile(self, dst: LayoutTensor[TileLoaderCPAsync[dtype, src_layout, thread_layout, swizzle_mode, vector_size]._dtype, layout, MutAnyOrigin, address_space=AddressSpace.SHARED, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=128], mem_barrier: LegacyUnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], coords: Tuple[UInt, UInt])

Load a tile using cp.async instructions.

Extracts a tile from the source tensor and performs an asynchronous copy to shared memory with bounds checking and swizzling.

Note: Unlike TMA, this method expects tile indices and handles the conversion to element offsets internally via the tile() method.

Args:

  • dst (LayoutTensor): Destination tile in shared memory.
  • mem_barrier (LegacyUnsafePointer): Memory barrier for synchronization (currently unused).
  • coords (Tuple): Tile indices (row_tile, col_tile) in the source matrix.

Was this page helpful?