IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

SubTileLoaderLDS

struct SubTileLoaderLDS[dtype: DType, swizzle: Optional[Swizzle] = Optional(), swizzle2: Optional[Swizzle] = Optional()]

DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads.

Sibling of TileLoaderLDS (warp-group cooperative coord-indexed). This one issues one buffer_load_*_lds burst per .load() call for a single source sub-tile. Attention's KV-cache warp DMA pattern: each warp claims a (warp_tile_rows, BK) slice of a (BN, K-span) DRAM tile and streams it into its SMEM lane.

The AMD buffer_load_*_lds intrinsic is emitted with the amdgpu.AsyncCopies alias scope via rocdl.raw.ptr.buffer.load.lds so consumer-side LDS reads tagged with noalias_scopes=_alias_scope_attr (see ds_read_tr* at lines 96, 419-480) can skip s_waitcnt vmcnt(0) — LLVM PR #74537's SIInsertWaitcnts vmcnt-relaxation handshake. Safe because attention kernels also maintain an explicit s_waitcnt vmcnt(0) + s_barrier fence at DMA/compute boundaries.

Why not stdlib load_to_lds[async_copies=True]: stdlib's async_copies=True attaches its OWN alias_scope MLIR attribute which is textually identical to _alias_scope_attr but an MLIR-distinct object; ScopedNoAliasAA matches by identity, so the DMA and LDS-consumer scopes don't match and the relaxation is silently disabled (MLA regresses 0.76 abs at output[0,0,0,0] — the same signature as b7b68a00290). Keeping the intrinsic emission local to this file so producer + consumer share the exact same _alias_scope_attr object. If stdlib ever exports the scope as a shareable symbol, collapse this body to bc.load_to_lds[ async_copies=True].

Constructs the AMDBufferResource once from a DRAM tile (which may carry Scalar valid_rows for bounds clamping via MixedLayout). Each load() call reuses the descriptor — one shared bc per tile, zero per-warp overhead for buffer resource construction. SRD bounds computed by make_amd_buffer_resource via _get_bounds; hardware clamps OOB reads to zero.

Parameters

  • dtype (DType): Element data type.
  • swizzle (Optional[Swizzle]): Optional swizzle for bank conflict reduction.
  • swizzle2 (Optional[Swizzle]): Optional second swizzle, applied AFTER swizzle. Use to compose two-XOR swizzles (e.g., HK's st_32x32 bit5^=bit9 + bit4^=bit10 byte-level pair, which are not expressible as a single Swizzle).

Fields

  • bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM access.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods

__init__

__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self

Create a loader from a DRAM tile.

The tile's layout carries the valid row count (via Scalar dim[0] in MixedLayout). make_amd_buffer_resource reads that dimension to compute the SRD size.

Args:

load

load(self, dst: TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size])

Load a warp sub-tile from DRAM to LDS.

The src tile should be a warp-sized sub-tile of the original DRAM tile. Offsets are computed relative to the bc's base pointer, so the src pointer must be within the original tile's address range.

Args: