IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileLoader

struct RegTileLoader[dtype: DType, thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], num_threads: Int = thread_layout.size(), warp_scope: Bool = False]

AMD buffer-resource load from DRAM to registers.

Pre-builds the AMDBufferResource from a DRAM TileTensor once. Each load() call distributes a source tile across threads and issues buffer_load intrinsics to fill a LOCAL register TileTensor.

The dst register tile uses row-major element ordering (per-thread (M, N) fragment stored with strides (N, 1)) so that dst row i is m_mma=i's contiguous fragment. RegTileWriterLDS.copy reads in the same row-major order; the two are paired and must agree.

Parameters​

  • ​dtype (DType): Element data type.
  • ​thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Thread distribution layout (e.g. row_majorr, c or col_majorr, c).
  • ​num_threads (Int): Total threads in the block. When the block has more threads than thread_layout.size(), extra threads are idled. Only needed when the block size differs from the layout size (e.g. attention uses a warp-sized layout within a larger block). Defaults to thread_layout.size().
  • ​warp_scope (Bool): If True, uses lane_id() as worker index (warp scope). If False, uses thread_idx.x (block scope).

Fields​

  • ​bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM loads.
  • ​base_ptr_as_int (Int): Integer address of the DRAM tile base pointer. Captured at construction so the per-thread base offset in _buffer_load_impl is computed relative to the buffer resource's base β€” not relative to src.ptr. src passed to load() may be a sub-tile of gmem_tile (matmul iterates a_blockrow.tile[BK, BM](k, 0) over k); the offset between the two pointers must fold into the per-thread vector_offset for buffer_load to address the correct rows.

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self

Creates a loader from a DRAM tile.

The TileTensor may carry Scalar for any masked dimension (e.g. valid_rows in MixedLayout) so that make_amd_buffer_resource computes correct OOB clamping bounds.

Args:

__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size], *, bounds_from: TileTensor[dtype, address_space=bounds_from.address_space, linear_idx_type=bounds_from.linear_idx_type, element_size=bounds_from.element_size]) -> Self

Creates a loader with OOB bounds from a full (pre-tiled) tensor.

TileTensor.tile produces compile-time shapes that are never clipped to the actual tensor extent. This overload derives the buffer resource clamping range from bounds_from (which carries runtime dimensions), so OOB loads return zero for partial edge blocks.

Args:

load​

load(self, dst: TileTensor[dtype, address_space=AddressSpace.LOCAL, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size])

Loads DRAM tile data into a LOCAL register tile.

Distributes src across threads, reads row-major from DRAM, stores row-major into dst (matched by RegTileWriterLDS.copy).

Args: