For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileLoader

struct RegTileLoader[dtype: DType, thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], num_threads: Int = thread_layout.size(), warp_scope: Bool = False]

AMD buffer-resource load from DRAM to registers.

Pre-builds the AMDBufferResource from a DRAM TileTensor once. Each load() call distributes a source tile across threads and issues buffer_load intrinsics to fill a LOCAL register TileTensor.

The dst register tile uses row-major element ordering (per-thread (M, N) fragment stored with strides (N, 1)) so that dst row i is m_mma=i's contiguous fragment. RegTileWriterLDS.copy reads in the same row-major order; the two are paired and must agree.

Parameters

dtype (DType): Element data type.
thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Thread distribution layout (e.g. row_majorr, c or col_majorr, c).
num_threads (Int): Total threads in the block. When the block has more threads than thread_layout.size(), extra threads are idled. Only needed when the block size differs from the layout size (e.g. attention uses a warp-sized layout within a larger block). Defaults to thread_layout.size().
warp_scope (Bool): If True, uses lane_id() as worker index (warp scope). If False, uses thread_idx.x (block scope).

Fields

bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM loads.
base_ptr_as_int (Int): Integer address of the DRAM tile base pointer. Captured at construction so the per-thread base offset in _buffer_load_impl is computed relative to the buffer resource's base — not relative to src.ptr. src passed to load() may be a sub-tile of gmem_tile (matmul iterates a_blockrow.tile[BK, BM](k, 0) over k); the offset between the two pointers must fold into the per-thread vector_offset for buffer_load to address the correct rows.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods

`init`

def __init__(gmem_tile: TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type]) -> Self

Creates a loader from a DRAM tile.

The TileTensor may carry Scalar for any masked dimension (e.g. valid_rows in MixedLayout) so that make_amd_buffer_resource computes correct OOB clamping bounds.

Args:

gmem_tile (TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type]): The DRAM tile as TileTensor.

def __init__(gmem_tile: TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type], *, bounds_from: TileTensor[dtype, Storage=bounds_from.Storage, address_space=bounds_from.address_space, linear_idx_type=bounds_from.linear_idx_type]) -> Self

Creates a loader with OOB bounds from a full (pre-tiled) tensor.

TileTensor.tile produces compile-time shapes that are never clipped to the actual tensor extent. This overload derives the buffer resource clamping range from bounds_from (which carries runtime dimensions), so OOB loads return zero for partial edge blocks.

Args:

gmem_tile (TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type]): Block-row tile (provides base pointer for loads).
bounds_from (TileTensor[dtype, Storage=bounds_from.Storage, address_space=bounds_from.address_space, linear_idx_type=bounds_from.linear_idx_type]): Full tensor with runtime dims for OOB bounds.

`load`

def load(self, dst: TileTensor[dtype, Storage=dst.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=dst.linear_idx_type], src: TileTensor[dtype, Storage=src.Storage, address_space=src.address_space, linear_idx_type=src.linear_idx_type])

Loads DRAM tile data into a LOCAL register tile.

Distributes src across threads, reads row-major from DRAM, stores row-major into dst (matched by RegTileWriterLDS.copy).

Args:

dst (TileTensor[dtype, Storage=dst.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=dst.linear_idx_type]): Destination register TileTensor (LOCAL address space).
src (TileTensor[dtype, Storage=src.Storage, address_space=src.address_space, linear_idx_type=src.linear_idx_type]): Source DRAM TileTensor (vectorized).

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

load​

Parameters

Fields

Implemented traits

Methods

`init`

`load`