Skip to main content

Mojo struct

TileLoaderCPAsync

struct TileLoaderCPAsync[dtype: DType, src_layout: Layout, thread_layout: Layout, swizzle_mode: TensorMapSwizzle, vector_size: Int]

Software-based tile loader using cp.async instructions.

This loader uses CUDA's cp.async instructions for asynchronous memory transfers with manual bounds checking and shared memory swizzling for optimal bank conflict avoidance.

Parameters​

  • ​dtype (DType): Data type of the elements being loaded.
  • ​src_layout (Layout): Layout of the source matrix in global memory.
  • ​thread_layout (Layout): Thread arrangement for distributed copying.
  • ​swizzle_mode (TensorMapSwizzle): Swizzling pattern for shared memory access.
  • ​vector_size (Int): Number of elements loaded per thread.

Fields​

  • ​src (LayoutTensor[dtype, src_layout, ImmutAnyOrigin]):

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TileLoader, TrivialRegisterPassable

Methods​

__init__​

__init__(src: LayoutTensor[dtype, src_layout, ImmutAnyOrigin]) -> Self

Initialize the cp.async tile loader.

Args:

load_tile​

load_tile(self, dst: LayoutTensor[dtype, MutAnyOrigin, address_space=AddressSpace.SHARED, element_layout=dst.element_layout, layout_int_type=dst.layout_int_type, linear_idx_type=dst.linear_idx_type, masked=dst.masked, alignment=128], mem_barrier: UnsafePointer[SharedMemBarrier, MutAnyOrigin, address_space=AddressSpace.SHARED], coords: Tuple[Int, Int])

Load a tile using cp.async instructions.

Extracts a tile from the source tensor and performs an asynchronous copy to shared memory with bounds checking and swizzling.

Note: Unlike TMA, this method expects tile indices and handles the conversion to element offsets internally via the tile() method.

Args: