For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileWriterLDS

struct RegTileWriterLDS[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], swizzle: Optional[Swizzle] = None, num_threads: Int = thread_layout.size()]

Stateless register→LDS copy expert.

Sibling to RegTileLoader / RegTileWriter (DRAM↔reg) and TileLoaderLDS / SubTileLoaderLDS (DRAM→LDS). Writes register tiles to shared memory via thread-distributed element stores.

Static methods: copy - Standard plain-SMEM write (rank-2 or rank-3 distributed layouts); reads src in row-major order to match RegTileLoader's storage. copy_blocked - blocked_product SMEM write with its own block_cols param. Used when thread_layout and SMEM layout have mismatched blocked structure that distribute_with_offset can't resolve.

Parameters

thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Thread distribution layout across the tile.
swizzle (Optional[Swizzle]): Optional SMEM swizzle for bank-conflict avoidance.
num_threads (Int): Number of threads to participate (threads past thread_layout.size() early-exit).

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`copy`

static def copy(dst: TileTensor[Storage=dst.Storage, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type], src: TileTensor[Storage=src.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=src.linear_idx_type])

Copy register data to SMEM, distributed across threads.

Reads src registers in row-major element order to match the storage convention of RegTileLoader. Supports both flat (rank 2) and hierarchical (rank 3) distributed layouts.

Args:

dst (TileTensor[Storage=dst.Storage, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type]): Destination TileTensor in shared memory.
src (TileTensor[Storage=src.Storage, address_space=AddressSpace.LOCAL, linear_idx_type=src.linear_idx_type]): Source TileTensor in local (register) memory.

`copy_blocked`

static def copy_blocked[block_cols: Int](dst: TileTensor[address_space=AddressSpace.SHARED], src: TileTensor[dst.dtype, address_space=AddressSpace.LOCAL])

Copy register tile to blocked_product SMEM layout.

Handles structural mismatches between thread_layout and SMEM layout by computing per-element SMEM offsets using the blocked_product formula. Reads registers sequentially as simd_width-wide vectors; this is invariant to col- vs row-major flat ordering when each per-thread row equals one SIMD vector.

The SMEM layout is blocked_product with blocks of dst.shape[0] x block_cols. thread_layout distributes a 2D grid of (data_rows, data_cols/simd_width) vector positions across threads.

Parameters:

block_cols (Int): Cols per SMEM block in blocked_product layout.

Args:

dst (TileTensor[address_space=AddressSpace.SHARED]): Destination [block_rows, data_cols] in SHARED.
src (TileTensor[dst.dtype, address_space=AddressSpace.LOCAL]): Source register tile in LOCAL (row-major elements).

Parameters​

Implemented traits​

Methods​

copy​

copy_blocked​

Parameters

Implemented traits

Methods

`copy`

`copy_blocked`