IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileWriter

struct RegTileWriter[dtype: DType, thread_rows: Int, thread_cols: Int]

AMD buffer-resource store for writing register tiles to DRAM.

Pre-builds the AMDBufferResource from the full DRAM output tile once. Each store() call writes a warp sub-tile's worth of register data to DRAM via the pre-built descriptor; OOB lanes (past the recorded byte bound) are silently dropped by the hardware clamp.

Pure TileTensor implementation β€” uses TileTensor distribute_with_offset directly (no LayoutTensor conversion). The distribute operation divides shape by thread_shape and multiplies strides by thread_shape, producing identical offsets to LayoutTensor's zipped_divide for flat 2D layouts.

A single store[mfma32: Bool = False] method handles both:

  • mfma32=False: Generic path using the src tile's own layout indexing (any MMA shape).
  • mfma32=True: 32Γ—32 MFMA path with hardware-specific register permutation (src[4*n + 16*m] β†’ fragment position 4*m + n).

See RegTileWriterLDS.copy for the matching row-major register reader used in DRAM→reg→SMEM pipelines.

The buffer-resource OOB clamp bounds the store by the destination tensor's TOTAL byte extent, not by a per-row column extent β€” so a SIMD chunk that straddles an N boundary (last column block when N % BN != 0) will spill into the next row of the same buffer instead of being clipped. Use RegTileEpilogue instead for kernels that need to support N-misaligned shapes (or a fused lambda).

Parameters​

  • ​dtype (DType): Element data type for DRAM destination.
  • ​thread_rows (Int): Number of rows in the col-major thread distribution.
  • ​thread_cols (Int): Number of columns in the col-major thread distribution.

Fields​

  • ​bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM stores.
  • ​base_ptr_as_int (Int): Integer address of the full DRAM tile base pointer.

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

comptime members​

thread_layout​

comptime thread_layout = col_major[thread_rows, thread_cols]()

Methods​

__init__​

__init__(dst_base: TileTensor[address_space=dst_base.address_space, linear_idx_type=dst_base.linear_idx_type, element_size=dst_base.element_size]) -> Self

Create a writer from the full DRAM output tile.

The TileTensor must carry Scalar for any masked dimension (e.g. valid_rows) so that make_amd_buffer_resource computes correct OOB clamping bounds.

Args:

store​

store[mfma32: Bool = False](self, dst_warp_tile: TileTensor[dtype, address_space=dst_warp_tile.address_space, linear_idx_type=dst_warp_tile.linear_idx_type, element_size=dst_warp_tile.element_size], src_tile: TileTensor[address_space=AddressSpace.LOCAL, linear_idx_type=src_tile.linear_idx_type, element_size=src_tile.element_size])

Write register tile data to a DRAM warp sub-tile.

The distribute + base-offset prologue is identical across MMA shapes; only the (iteration index i, src scalar offset) pair differs:

  • mfma32=False: iterate i in range(dst_shape0 * dst_shape1), read src at (i // src_cols) * src_stride0 + (i % src_cols) * elem_size (source's natural row-major layout).
  • mfma32=True: iterate (m, n) over (src_shape0, src_shape1 / elem_size), read src at 4*n + 16*m (32Γ—32 MFMA register permutation) and map to fragment position i = 4*m + n.

Parameters:

  • ​mfma32 (Bool): Select the 32Γ—32 MFMA register permutation instead of the src tile's natural layout.

Args: