Mojo struct

TileLoaderTMA

@register_passable(trivial) struct TileLoaderTMA[_mlir_origin: LITImmutOrigin, //, tma_origin: ImmutOrigin, dtype: DType, tile_layout: Layout, desc_layout: Layout, /, *, BK: Scalar[DType.uint], cluster_size: Int32, use_partitioned_multicast: Bool]

TMA-based tile loader for hardware-accelerated memory transfers.

This loader uses NVIDIA's Tensor Memory Accelerator (TMA) for efficient 2D tile transfers from global to shared memory, with optional multicast support for multi-block clusters.

Parameters

tma_origin (ImmutOrigin): Origin type for the TMA operation.
dtype (DType): Data type of the elements being loaded.
tile_layout (Layout): Layout of the complete tile in shared memory.
desc_layout (Layout): Layout described by the TMA descriptor (may be smaller).
BK (Scalar): Block size in the K dimension (for coordinate conversion).
cluster_size (Int32): Number of blocks in the cluster (1 for no clustering).
use_partitioned_multicast (Bool): Whether to use partitioned multicast loading.

Fields

tma_op (TileLoaderTMA[tma_origin, dtype, tile_layout, desc_layout, BK=BK, cluster_size=cluster_size, use_partitioned_multicast=use_partitioned_multicast].TMATensorTilePtr):
rank (UInt):
multicast_mask (UInt16):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TileLoader, TrivialRegisterPassable

`comptime` members

`__copy_ctor_is_trivial`

comptime __copy_ctor_is_trivial = True

`delis_trivial`

comptime __del__is_trivial = True

`__move_ctor_is_trivial`

comptime __move_ctor_is_trivial = True

`TMATensorTilePtr`

comptime TMATensorTilePtr = Pointer[TMATensorTile[dtype, tile_layout, desc_layout], tma_origin]

Methods

`init`

__init__(tma_op: Pointer[TMATensorTile[dtype, tile_layout, desc_layout], tma_origin], rank: Scalar[DType.uint], multicast_mask: UInt16) -> Self

Initialize the TMA tile loader.

Args:

tma_op (Pointer): Pointer to the TMA tensor descriptor.
rank (Scalar): Rank of this block within the cluster.
multicast_mask (UInt16): Bit mask for multicast targets.

`load_tile`

load_tile(self, dst: LayoutTensor[dtype, dst.layout, MutAnyOrigin, address_space=AddressSpace.SHARED, element_layout=dst.element_layout, layout_int_type=dst.layout_int_type, linear_idx_type=dst.linear_idx_type, masked=dst.masked, alignment=128], mem_barrier: LegacyUnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], _coords: Tuple[UInt, UInt])

Load a tile using TMA hardware acceleration.

Converts tile indices to element coordinates and initiates a TMA transfer. For clusters, uses multicast to share data across blocks.

Note: Coordinates are converted from (row, col) tile indices to (k_elements, row/col_elements) for TMA's K-major ordering.

Args:

dst (LayoutTensor): Destination tile in shared memory.
mem_barrier (LegacyUnsafePointer): Memory barrier for synchronization.
_coords (Tuple): Tile coordinates (row_tile_idx, col_tile_idx).

Parameters​

Fields​

Implemented traits​

comptime members​

__copy_ctor_is_trivial​

__del__is_trivial​

__move_ctor_is_trivial​

TMATensorTilePtr​

Methods​

__init__​

load_tile​