For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

SubTileLoaderLDS

struct SubTileLoaderLDS[dtype: DType, swizzle: Optional[Swizzle] = Optional(), swizzle2: Optional[Swizzle] = Optional()]

DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads.

Sibling of TileLoaderLDS (warp-group cooperative coord-indexed). This one issues one buffer_load_*_lds burst per .load() call for a single source sub-tile. Attention's KV-cache warp DMA pattern: each warp claims a (warp_tile_rows, BK) slice of a (BN, K-span) DRAM tile and streams it into its SMEM lane.

The AMD buffer_load_*_lds intrinsic is emitted with the amdgpu.AsyncCopies alias scope via rocdl.raw.ptr.buffer.load.lds so consumer-side LDS reads tagged with noalias_scopes=_alias_scope_attr (see ds_read_tr* at lines 96, 419-480) can skip s_waitcnt vmcnt(0) — LLVM PR #74537's SIInsertWaitcnts vmcnt-relaxation handshake. Safe because attention kernels also maintain an explicit s_waitcnt vmcnt(0) + s_barrier fence at DMA/compute boundaries.

Why not stdlib load_to_lds[async_copies=True]: stdlib's async_copies=True attaches its OWN alias_scope MLIR attribute which is textually identical to _alias_scope_attr but an MLIR-distinct object; ScopedNoAliasAA matches by identity, so the DMA and LDS-consumer scopes don't match and the relaxation is silently disabled (MLA regresses 0.76 abs at output[0,0,0,0] — the same signature as b7b68a00290). Keeping the intrinsic emission local to this file so producer + consumer share the exact same _alias_scope_attr object. If stdlib ever exports the scope as a shareable symbol, collapse this body to bc.load_to_lds[ async_copies=True].

Constructs the AMDBufferResource once from a DRAM tile (which may carry Scalar valid_rows for bounds clamping via MixedLayout). Each load() call reuses the descriptor — one shared bc per tile, zero per-warp overhead for buffer resource construction. SRD bounds computed by make_amd_buffer_resource via _get_bounds; hardware clamps OOB reads to zero.

Parameters

dtype (DType): Element data type.
swizzle (Optional[Swizzle]): Optional swizzle for bank conflict reduction.
swizzle2 (Optional[Swizzle]): Optional second swizzle, applied AFTER swizzle. Use to compose two-XOR swizzles (e.g., the reference st_32x32 bit5^=bit9 + bit4^=bit10 byte-level pair, which are not expressible as a single Swizzle).

Fields

bc (AMDBufferResource): The 128-bit buffer resource descriptor for DRAM access.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods

`init`

def __init__(gmem_tile: TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type]) -> Self

Create a loader from a DRAM tile.

The tile's layout carries the valid row count (via Scalar dim[0] in MixedLayout). make_amd_buffer_resource reads that dimension to compute the SRD size.

Args:

gmem_tile (TileTensor[dtype, Storage=gmem_tile.Storage, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type]): The full DRAM tile from KVCacheIterator.

`load`

def load[hoist_scalar_offset: Bool = False](self, dst: TileTensor[dtype, Storage=dst.Storage, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type], src: TileTensor[dtype, Storage=src.Storage, address_space=src.address_space, linear_idx_type=src.linear_idx_type], scalar_offset: Int = Int(0), worker_base: Int = Int(0))

Load a warp sub-tile from DRAM to LDS.

The src tile should be a warp-sized sub-tile of the original DRAM tile. Offsets are computed relative to the bc's base pointer, so the src pointer must be within the original tile's address range.

Comptime hoist_scalar_offset selects which scalar-offset codegen path the inner loop takes:

False (default, legacy codegen): scalar_offset is ignored. Each iteration recomputes Int(src_partitions.ptr) - dram_base where src_partitions is the per-iteration sub-tile of src. Matches the pre-refactor inline DMA emission — s_add of the per-iter pointer base + bc-base subtract. This is what MhaPrefillV2, KVBuffer, and _MlaKDmaPair at KV<128 want: the legacy SGPR pressure profile that benches verified at KV=64 (no -17% regression).
True (opt-in hoist): the caller's scalar_offset is used directly + comptime partition_offset_bytes. The runtime piece is computed ONCE at the call site and shared across the inner loop iterations. This is what _MlaKDmaPair at KV>=128 wants: one SGPR carries the hoisted base across both dma_nope+dma_rope, giving the +7% KV=128 lift.

Parameters:

hoist_scalar_offset (Bool): Comptime flag selecting between legacy per-iter codegen (False, default) and the explicit caller-supplied hoisted base (True).

Args:

dst (TileTensor[dtype, Storage=dst.Storage, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type]): Destination TileTensor in shared memory.
src (TileTensor[dtype, Storage=src.Storage, address_space=src.address_space, linear_idx_type=src.linear_idx_type]): Source TileTensor in global memory (warp sub-tile).
scalar_offset (Int): Wave-uniform byte offset of src relative to the buffer-resource base. Only consumed when hoist_scalar_offset is True; pass 0 (or any value — it is dead-code-eliminated) when False.
worker_base (Int): Sub-tile row-strip index for cooperative half-sub-block loads (N-warps-per-subblock partition at depths < 128). When a caller splits a BM-row sub-block across N warps and passes each warp its own M = BM/N-row strip, the loader's internal m_sub_tile collapses to {0} and the swizzle would be computed as if the strip were the FIRST sub-row — dropping the m_sub_tile * WARP_SIZE worker offset that the two-XOR st_32x32_s swizzle needs (the Swizzle(1,0,6) bit-0 flip keys off worker bit 6). Pass the strip's absolute sub-row index here so the swizzle matches the consumer read (MhaMmaOp.load_K). Default 0 = full sub-block load (depth 128), unchanged.

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

load​