For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

TileLoaderCPAsync

struct TileLoaderCPAsync[dtype: DType, src_layout: TensorLayout, thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], swizzle_mode: TensorMapSwizzle, vector_size: Int]

Software-based tile loader using cp.async instructions.

This loader uses CUDA's cp.async instructions for asynchronous memory transfers with manual bounds checking and shared memory swizzling for optimal bank conflict avoidance.

Parameters

dtype (DType): Data type of the elements being loaded.
src_layout (TensorLayout): Layout of the source matrix in global memory.
thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Thread arrangement for distributed copying.
swizzle_mode (TensorMapSwizzle): Swizzling pattern for shared memory access.
vector_size (Int): Number of elements loaded per thread.

Fields

src (TileTensor[dtype, src_layout, ImmutAnyOrigin]):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TileLoader, TrivialRegisterPassable

Methods

`init`

def __init__(src: TileTensor[dtype, src_layout, ImmutAnyOrigin]) -> Self

Initialize the cp.async tile loader.

Args:

src (TileTensor[dtype, src_layout, ImmutAnyOrigin]): Source tensor in global memory.

`load_tile`

def load_tile(self, dst: TileTensor[address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], mem_barrier: UnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], coords: Tuple[Int, Int])

Load a tile using cp.async instructions.

Extracts a tile from the source tensor and performs an asynchronous copy to shared memory with bounds checking and swizzling.

Note: Unlike TMA, this method expects tile indices and handles the conversion to element offsets internally via the tile() method.

Args:

dst (TileTensor[address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size]): Destination tile in shared memory.
mem_barrier (UnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED]): Memory barrier for synchronization (currently unused).
coords (Tuple[Int, Int]): Tile indices (row_tile, col_tile) in the source matrix.

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

load_tile​