For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileWriter

struct RegTileWriter[dtype: DType, thread_rows: Int, thread_cols: Int]

AMD buffer-resource store for writing register tiles to DRAM.

Pre-builds the AMDBufferResource from the full DRAM output tile once. Each store() call writes a warp sub-tile's worth of register data to DRAM via the pre-built descriptor; OOB lanes (past the recorded byte bound) are silently dropped by the hardware clamp.

Pure TileTensor implementation — uses TileTensor distribute_with_offset directly (no LayoutTensor conversion). The distribute operation divides shape by thread_shape and multiplies strides by thread_shape, producing identical offsets to LayoutTensor's zipped_divide for flat 2D layouts.

A single store[mfma32: Bool = False] method handles both:

mfma32=False: Generic path using the src tile's own layout indexing (any MMA shape).
mfma32=True: 32×32 MFMA path with hardware-specific register permutation (src[4*n + 16*m] → fragment position 4*m + n).

See RegTileWriterLDS.copy for the matching row-major register reader used in DRAM→reg→SMEM pipelines.

The buffer-resource OOB clamp bounds the store by the destination tensor's TOTAL byte extent, not by a per-row column extent — so a SIMD chunk that straddles an N boundary (last column block when N % BN != 0) will spill into the next row of the same buffer instead of being clipped. Use RegTileEpilogue instead for kernels that need to support N-misaligned shapes (or a fused lambda).

Parameters

dtype (DType): Element data type for DRAM destination.
thread_rows (Int): Number of rows in the col-major thread distribution.
thread_cols (Int): Number of columns in the col-major thread distribution.

Fields

bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM stores.
base_ptr_as_int (Int): Integer address of the full DRAM tile base pointer.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

`comptime` members

`thread_layout`

comptime thread_layout = col_major[thread_rows, thread_cols]()

Methods

`init`

def __init__(dst_base: TileTensor[Storage=dst_base.Storage, address_space=dst_base.address_space, linear_idx_type=dst_base.linear_idx_type]) -> Self

Create a writer from the full DRAM output tile.

The TileTensor must carry Scalar for any masked dimension (e.g. valid_rows) so that make_amd_buffer_resource computes correct OOB clamping bounds.

Args:

dst_base (TileTensor[Storage=dst_base.Storage, address_space=dst_base.address_space, linear_idx_type=dst_base.linear_idx_type]): The full DRAM output tile as TileTensor.

`store`

def store[mfma32: Bool = False](self, dst_warp_tile: TileTensor[dtype, Storage=dst_warp_tile.Storage, address_space=dst_warp_tile.address_space, linear_idx_type=dst_warp_tile.linear_idx_type], src_tile: TileTensor[Storage=src_tile.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=src_tile.linear_idx_type])

Write register tile data to a DRAM warp sub-tile.

The distribute + base-offset prologue is identical across MMA shapes; only the (iteration index i, src scalar offset) pair differs:

mfma32=False: iterate i in range(dst_shape0 * dst_shape1), read src at (i // src_cols) * src_stride0 + (i % src_cols) * elem_size (source's natural row-major layout).
mfma32=True: iterate (m, n) over (src_shape0, src_shape1 / elem_size), read src at 4*n + 16*m (32×32 MFMA register permutation) and map to fragment position i = 4*m + n.

Parameters:

mfma32 (Bool): Select the 32×32 MFMA register permutation instead of the src tile's natural layout.

Args:

dst_warp_tile (TileTensor[dtype, Storage=dst_warp_tile.Storage, address_space=dst_warp_tile.address_space, linear_idx_type=dst_warp_tile.linear_idx_type]): Vectorized DRAM warp sub-tile.
src_tile (TileTensor[Storage=src_tile.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=src_tile.linear_idx_type]): Register TileTensor with MMA output data.

Parameters​

Fields​

Implemented traits​

comptime members​

thread_layout​

Methods​

__init__​

store​