For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
RegTileLoader
struct RegTileLoader[dtype: DType, thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], num_threads: Int = thread_layout.size(), warp_scope: Bool = False]
AMD buffer-resource load from DRAM to registers.
Pre-builds the AMDBufferResource from a DRAM TileTensor once. Each load() call distributes a source tile across threads and issues buffer_load intrinsics to fill a LOCAL register TileTensor.
The dst register tile uses row-major element ordering (per-thread
(M, N) fragment stored with strides (N, 1)) so that dst row i is
m_mma=i's contiguous fragment. RegTileWriterLDS.copy reads in the
same row-major order; the two are paired and must agree.
Parametersβ
- βdtype (
DType): Element data type. - βthread_layout (
Layout[thread_layout.shape_types, thread_layout.stride_types]): Thread distribution layout (e.g. row_majorr, c or col_majorr, c). - βnum_threads (
Int): Total threads in the block. When the block has more threads than thread_layout.size(), extra threads are idled. Only needed when the block size differs from the layout size (e.g. attention uses a warp-sized layout within a larger block). Defaults to thread_layout.size(). - βwarp_scope (
Bool): If True, uses lane_id() as worker index (warp scope). If False, uses thread_idx.x (block scope).
Fieldsβ
- βbc (
AMDBufferResource): The 128-bit buffer resource descriptor for DRAM loads. - βbase_ptr_as_int (
Int): Integer address of the DRAM tile base pointer. Captured at construction so the per-thread base offset in_buffer_load_implis computed relative to the buffer resource's base β not relative tosrc.ptr.srcpassed toload()may be a sub-tile ofgmem_tile(matmul iteratesa_blockrow.tile[BK, BM](k, 0)over k); the offset between the two pointers must fold into the per-threadvector_offsetfor buffer_load to address the correct rows.
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methodsβ
__init__β
__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self
Creates a loader from a DRAM tile.
The TileTensor may carry Scalar for any masked dimension (e.g. valid_rows in MixedLayout) so that make_amd_buffer_resource computes correct OOB clamping bounds.
Args:
- βgmem_tile (
TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]): The DRAM tile as TileTensor.
__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size], *, bounds_from: TileTensor[dtype, address_space=bounds_from.address_space, linear_idx_type=bounds_from.linear_idx_type, element_size=bounds_from.element_size]) -> Self
Creates a loader with OOB bounds from a full (pre-tiled) tensor.
TileTensor.tile produces compile-time shapes that are never clipped to the actual tensor extent. This overload derives the buffer resource clamping range from bounds_from (which carries runtime dimensions), so OOB loads return zero for partial edge blocks.
Args:
- βgmem_tile (
TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]): Block-row tile (provides base pointer for loads). - βbounds_from (
TileTensor[dtype, address_space=bounds_from.address_space, linear_idx_type=bounds_from.linear_idx_type, element_size=bounds_from.element_size]): Full tensor with runtime dims for OOB bounds.
loadβ
load(self, dst: TileTensor[dtype, address_space=AddressSpace.LOCAL, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size])
Loads DRAM tile data into a LOCAL register tile.
Distributes src across threads, reads row-major from DRAM,
stores row-major into dst (matched by RegTileWriterLDS.copy).
Args:
- βdst (
TileTensor[dtype, address_space=AddressSpace.LOCAL, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size]): Destination register TileTensor (LOCAL address space). - βsrc (
TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size]): Source DRAM TileTensor (vectorized).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!