IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

SubTileLoaderLDS

struct SubTileLoaderLDS[dtype: DType, swizzle: Optional[Swizzle] = Optional(), swizzle2: Optional[Swizzle] = Optional()]

DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads.

Sibling of TileLoaderLDS (warp-group cooperative coord-indexed). This one issues one buffer_load_*_lds burst per .load() call for a single source sub-tile. Attention's KV-cache warp DMA pattern: each warp claims a (warp_tile_rows, BK) slice of a (BN, K-span) DRAM tile and streams it into its SMEM lane.

The AMD buffer_load_*_lds intrinsic is emitted with the amdgpu.AsyncCopies alias scope via rocdl.raw.ptr.buffer.load.lds so consumer-side LDS reads tagged with noalias_scopes=_alias_scope_attr (see ds_read_tr* at lines 96, 419-480) can skip s_waitcnt vmcnt(0) β€” LLVM PR #74537's SIInsertWaitcnts vmcnt-relaxation handshake. Safe because attention kernels also maintain an explicit s_waitcnt vmcnt(0) + s_barrier fence at DMA/compute boundaries.

Why not stdlib load_to_lds[async_copies=True]: stdlib's async_copies=True attaches its OWN alias_scope MLIR attribute which is textually identical to _alias_scope_attr but an MLIR-distinct object; ScopedNoAliasAA matches by identity, so the DMA and LDS-consumer scopes don't match and the relaxation is silently disabled (MLA regresses 0.76 abs at output[0,0,0,0] β€” the same signature as b7b68a00290). Keeping the intrinsic emission local to this file so producer + consumer share the exact same _alias_scope_attr object. If stdlib ever exports the scope as a shareable symbol, collapse this body to bc.load_to_lds[ async_copies=True].

Constructs the AMDBufferResource once from a DRAM tile (which may carry Scalar valid_rows for bounds clamping via MixedLayout). Each load() call reuses the descriptor β€” one shared bc per tile, zero per-warp overhead for buffer resource construction. SRD bounds computed by make_amd_buffer_resource via _get_bounds; hardware clamps OOB reads to zero.

Parameters​

  • ​dtype (DType): Element data type.
  • ​swizzle (Optional[Swizzle]): Optional swizzle for bank conflict reduction.
  • ​swizzle2 (Optional[Swizzle]): Optional second swizzle, applied AFTER swizzle. Use to compose two-XOR swizzles (e.g., the reference st_32x32 bit5^=bit9 + bit4^=bit10 byte-level pair, which are not expressible as a single Swizzle).

Fields​

  • ​bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM access.

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods​

__init__​

def __init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self

Create a loader from a DRAM tile.

The tile's layout carries the valid row count (via Scalar dim[0] in MixedLayout). make_amd_buffer_resource reads that dimension to compute the SRD size.

Args:

load​

def load[hoist_scalar_offset: Bool = False](self, dst: TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size], scalar_offset: Int = 0, worker_base: Int = 0)

Load a warp sub-tile from DRAM to LDS.

The src tile should be a warp-sized sub-tile of the original DRAM tile. Offsets are computed relative to the bc's base pointer, so the src pointer must be within the original tile's address range.

Comptime hoist_scalar_offset selects which scalar-offset codegen path the inner loop takes:

  • False (default, legacy codegen): scalar_offset is ignored. Each iteration recomputes Int(src_partitions.ptr) - dram_base where src_partitions is the per-iteration sub-tile of src. Matches the pre-refactor inline DMA emission β€” s_add of the per-iter pointer base + bc-base subtract. This is what MhaPrefillV2, KVBuffer, and _MlaKDmaPair at KV<128 want: the legacy SGPR pressure profile that benches verified at KV=64 (no -17% regression).
  • True (opt-in hoist): the caller's scalar_offset is used directly + comptime partition_offset_bytes. The runtime piece is computed ONCE at the call site and shared across the inner loop iterations. This is what _MlaKDmaPair at KV>=128 wants: one SGPR carries the hoisted base across both dma_nope+dma_rope, giving the +7% KV=128 lift.

Parameters:

  • ​hoist_scalar_offset (Bool): Comptime flag selecting between legacy per-iter codegen (False, default) and the explicit caller-supplied hoisted base (True).

Args:

  • ​dst (TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size]): Destination TileTensor in shared memory.
  • ​src (TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size]): Source TileTensor in global memory (warp sub-tile).
  • ​scalar_offset (Int): Wave-uniform byte offset of src relative to the buffer-resource base. Only consumed when hoist_scalar_offset is True; pass 0 (or any value β€” it is dead-code-eliminated) when False.
  • ​worker_base (Int): Sub-tile row-strip index for cooperative half-sub-block loads (N-warps-per-subblock partition at depths < 128). When a caller splits a BM-row sub-block across N warps and passes each warp its own M = BM/N-row strip, the loader's internal m_sub_tile collapses to {0} and the swizzle would be computed as if the strip were the FIRST sub-row β€” dropping the m_sub_tile * WARP_SIZE worker offset that the two-XOR st_32x32_s swizzle needs (the Swizzle(1,0,6) bit-0 flip keys off worker bit 6). Pass the strip's absolute sub-row index here so the swizzle matches the consumer read (MhaMmaOp.load_K). Default 0 = full sub-block load (depth 128), unchanged.