Mojo struct
TileLoaderCPAsync
struct TileLoaderCPAsync[dtype: DType, src_layout: Layout, thread_layout: Layout, swizzle_mode: TensorMapSwizzle, vector_size: Int]
Software-based tile loader using cp.async instructions.
This loader uses CUDA's cp.async instructions for asynchronous memory transfers with manual bounds checking and shared memory swizzling for optimal bank conflict avoidance.
Parametersβ
- βdtype (
DType): Data type of the elements being loaded. - βsrc_layout (
Layout): Layout of the source matrix in global memory. - βthread_layout (
Layout): Thread arrangement for distributed copying. - βswizzle_mode (
TensorMapSwizzle): Swizzling pattern for shared memory access. - βvector_size (
Int): Number of elements loaded per thread.
Fieldsβ
- βsrc (
LayoutTensor[dtype, src_layout, ImmutAnyOrigin]):
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TileLoader,
TrivialRegisterPassable
Methodsβ
__init__β
__init__(src: LayoutTensor[dtype, src_layout, ImmutAnyOrigin]) -> Self
Initialize the cp.async tile loader.
Args:
- βsrc (
LayoutTensor[dtype, src_layout, ImmutAnyOrigin]): Source tensor in global memory.
load_tileβ
load_tile(self, dst: LayoutTensor[dtype, MutAnyOrigin, address_space=AddressSpace.SHARED, element_layout=dst.element_layout, layout_int_type=dst.layout_int_type, linear_idx_type=dst.linear_idx_type, masked=dst.masked, alignment=128], mem_barrier: UnsafePointer[SharedMemBarrier, MutAnyOrigin, address_space=AddressSpace.SHARED], coords: Tuple[Int, Int])
Load a tile using cp.async instructions.
Extracts a tile from the source tensor and performs an asynchronous copy to shared memory with bounds checking and swizzling.
Note: Unlike TMA, this method expects tile indices and handles the conversion to element offsets internally via the tile() method.
Args:
- βdst (
LayoutTensor[dtype, MutAnyOrigin, address_space=AddressSpace.SHARED, element_layout=dst.element_layout, layout_int_type=dst.layout_int_type, linear_idx_type=dst.linear_idx_type, masked=dst.masked, alignment=128]): Destination tile in shared memory. - βmem_barrier (
UnsafePointer[SharedMemBarrier, MutAnyOrigin, address_space=AddressSpace.SHARED]): Memory barrier for synchronization (currently unused). - βcoords (
Tuple[Int, Int]): Tile indices (row_tile, col_tile) in the source matrix.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!