For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
RegTileWriter
struct RegTileWriter[dtype: DType, thread_rows: Int, thread_cols: Int]
AMD buffer-resource store for writing register tiles to DRAM.
Pre-builds the AMDBufferResource from the full DRAM output tile once.
Each store() call writes a warp sub-tile's worth of register data
to DRAM via the pre-built descriptor; OOB lanes (past the recorded
byte bound) are silently dropped by the hardware clamp.
Pure TileTensor implementation β uses TileTensor distribute_with_offset directly (no LayoutTensor conversion). The distribute operation divides shape by thread_shape and multiplies strides by thread_shape, producing identical offsets to LayoutTensor's zipped_divide for flat 2D layouts.
A single store[mfma32: Bool = False] method handles both:
- mfma32=False: Generic path using the src tile's own layout indexing (any MMA shape).
- mfma32=True: 32Γ32 MFMA path with hardware-specific register
permutation (
src[4*n + 16*m]β fragment position4*m + n).
See RegTileWriterLDS.copy for the matching row-major register reader
used in DRAMβregβSMEM pipelines.
The buffer-resource OOB clamp bounds the store by the destination
tensor's TOTAL byte extent, not by a per-row column extent β so a
SIMD chunk that straddles an N boundary (last column block when
N % BN != 0) will spill into the next row of the same buffer
instead of being clipped. Use RegTileEpilogue instead for kernels
that need to support N-misaligned shapes (or a fused lambda).
Parametersβ
- βdtype (
DType): Element data type for DRAM destination. - βthread_rows (
Int): Number of rows in the col-major thread distribution. - βthread_cols (
Int): Number of columns in the col-major thread distribution.
Fieldsβ
- βbc (
AMDBufferResource): The 128-bit buffer resource descriptor for DRAM stores. - βbase_ptr_as_int (
Int): Integer address of the full DRAM tile base pointer.
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
comptime membersβ
thread_layoutβ
comptime thread_layout = col_major[thread_rows, thread_cols]()
Methodsβ
__init__β
__init__(dst_base: TileTensor[address_space=dst_base.address_space, linear_idx_type=dst_base.linear_idx_type, element_size=dst_base.element_size]) -> Self
Create a writer from the full DRAM output tile.
The TileTensor must carry Scalar for any masked dimension (e.g. valid_rows) so that make_amd_buffer_resource computes correct OOB clamping bounds.
Args:
- βdst_base (
TileTensor[address_space=dst_base.address_space, linear_idx_type=dst_base.linear_idx_type, element_size=dst_base.element_size]): The full DRAM output tile as TileTensor.
storeβ
store[mfma32: Bool = False](self, dst_warp_tile: TileTensor[dtype, address_space=dst_warp_tile.address_space, linear_idx_type=dst_warp_tile.linear_idx_type, element_size=dst_warp_tile.element_size], src_tile: TileTensor[address_space=AddressSpace.LOCAL, linear_idx_type=src_tile.linear_idx_type, element_size=src_tile.element_size])
Write register tile data to a DRAM warp sub-tile.
The distribute + base-offset prologue is identical across MMA
shapes; only the (iteration index i, src scalar offset) pair
differs:
mfma32=False: iteratei in range(dst_shape0 * dst_shape1), read src at(i // src_cols) * src_stride0 + (i % src_cols) * elem_size(source's natural row-major layout).mfma32=True: iterate(m, n)over(src_shape0, src_shape1 / elem_size), read src at4*n + 16*m(32Γ32 MFMA register permutation) and map to fragment positioni = 4*m + n.
Parameters:
- βmfma32 (
Bool): Select the 32Γ32 MFMA register permutation instead of the src tile's natural layout.
Args:
- βdst_warp_tile (
TileTensor[dtype, address_space=dst_warp_tile.address_space, linear_idx_type=dst_warp_tile.linear_idx_type, element_size=dst_warp_tile.element_size]): Vectorized DRAM warp sub-tile. - βsrc_tile (
TileTensor[address_space=AddressSpace.LOCAL, linear_idx_type=src_tile.linear_idx_type, element_size=src_tile.element_size]): Register TileTensor with MMA output data.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!