IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RegTileWriterLDS

struct RegTileWriterLDS[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], swizzle: Optional[Swizzle] = None, num_threads: Int = thread_layout.size()]

Stateless register→LDS copy expert.

Sibling to RegTileLoader / RegTileWriter (DRAM↔reg) and TileLoaderLDS / SubTileLoaderLDS (DRAM→LDS). Writes register tiles to shared memory via thread-distributed element stores.

Static methods: copy - Standard plain-SMEM write (rank-2 or rank-3 distributed layouts); reads src in row-major order to match RegTileLoader's storage. copy_blocked - blocked_product SMEM write with its own block_cols param. Used when thread_layout and SMEM layout have mismatched blocked structure that distribute_with_offset can't resolve.

Parameters

Implemented traits

AnyType, ImplicitlyDestructible

Methods

copy

static copy(dst: TileTensor[address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[address_space=AddressSpace.LOCAL, linear_idx_type=src.linear_idx_type, element_size=src.element_size])

Copy register data to SMEM, distributed across threads.

Reads src registers in row-major element order to match the storage convention of RegTileLoader. Supports both flat (rank 2) and hierarchical (rank 3) distributed layouts.

Args:

copy_blocked

static copy_blocked[block_cols: Int](dst: TileTensor[address_space=AddressSpace.SHARED], src: TileTensor[dst.dtype, address_space=AddressSpace.LOCAL])

Copy register tile to blocked_product SMEM layout.

Handles structural mismatches between thread_layout and SMEM layout by computing per-element SMEM offsets using the blocked_product formula. Reads registers sequentially as simd_width-wide vectors; this is invariant to col- vs row-major flat ordering when each per-thread row equals one SIMD vector.

The SMEM layout is blocked_product with blocks of dst.shape[0] x block_cols. thread_layout distributes a 2D grid of (data_rows, data_cols/simd_width) vector positions across threads.

Parameters:

  • block_cols (Int): Cols per SMEM block in blocked_product layout.

Args: