For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
SubTileLoaderLDS
struct SubTileLoaderLDS[dtype: DType, swizzle: Optional[Swizzle] = Optional(), swizzle2: Optional[Swizzle] = Optional()]
DRAMβLDS DMA expert for single-sub-tile TileTensor-indexed loads.
Sibling of TileLoaderLDS (warp-group cooperative coord-indexed).
This one issues one buffer_load_*_lds burst per .load() call for
a single source sub-tile. Attention's KV-cache warp DMA pattern:
each warp claims a (warp_tile_rows, BK) slice of a
(BN, K-span) DRAM tile and streams it into its SMEM lane.
The AMD buffer_load_*_lds intrinsic is emitted with the
amdgpu.AsyncCopies alias scope via rocdl.raw.ptr.buffer.load.lds
so consumer-side LDS reads tagged with
noalias_scopes=_alias_scope_attr (see ds_read_tr* at lines 96,
419-480) can skip s_waitcnt vmcnt(0) β LLVM PR #74537's
SIInsertWaitcnts vmcnt-relaxation handshake. Safe because
attention kernels also maintain an explicit
s_waitcnt vmcnt(0) + s_barrier fence at DMA/compute boundaries.
Why not stdlib load_to_lds[async_copies=True]: stdlib's
async_copies=True attaches its OWN alias_scope MLIR attribute
which is textually identical to _alias_scope_attr but an
MLIR-distinct object; ScopedNoAliasAA matches by identity, so
the DMA and LDS-consumer scopes don't match and the relaxation is
silently disabled (MLA regresses 0.76 abs at output[0,0,0,0] β the
same signature as b7b68a00290). Keeping the intrinsic emission
local to this file so producer + consumer share the exact same
_alias_scope_attr object. If stdlib ever exports the scope as a
shareable symbol, collapse this body to bc.load_to_lds[ async_copies=True].
Constructs the AMDBufferResource once from a DRAM tile (which may
carry Scalar valid_rows for bounds clamping via MixedLayout).
Each load() call reuses the descriptor β one shared bc per tile,
zero per-warp overhead for buffer resource construction. SRD bounds
computed by make_amd_buffer_resource via _get_bounds; hardware
clamps OOB reads to zero.
Parametersβ
- βdtype (
DType): Element data type. - βswizzle (
Optional[Swizzle]): Optional swizzle for bank conflict reduction. - βswizzle2 (
Optional[Swizzle]): Optional second swizzle, applied AFTERswizzle. Use to compose two-XOR swizzles (e.g., the referencest_32x32bit5^=bit9+bit4^=bit10byte-level pair, which are not expressible as a single Swizzle).
Fieldsβ
- βbc (
AMDBufferResource): The 128-bit buffer resource descriptor for DRAM access.
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDeletable,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methodsβ
__init__β
def __init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self
Create a loader from a DRAM tile.
The tile's layout carries the valid row count (via Scalar dim[0] in MixedLayout). make_amd_buffer_resource reads that dimension to compute the SRD size.
Args:
- βgmem_tile (
TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]): The full DRAM tile from KVCacheIterator.
loadβ
def load[hoist_scalar_offset: Bool = False](self, dst: TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size], scalar_offset: Int = 0, worker_base: Int = 0)
Load a warp sub-tile from DRAM to LDS.
The src tile should be a warp-sized sub-tile of the original DRAM tile. Offsets are computed relative to the bc's base pointer, so the src pointer must be within the original tile's address range.
Comptime hoist_scalar_offset selects which scalar-offset
codegen path the inner loop takes:
False(default, legacy codegen):scalar_offsetis ignored. Each iteration recomputesInt(src_partitions.ptr) - dram_basewheresrc_partitionsis the per-iteration sub-tile ofsrc. Matches the pre-refactor inline DMA emission βs_addof the per-iter pointer base + bc-base subtract. This is whatMhaPrefillV2,KVBuffer, and_MlaKDmaPairat KV<128 want: the legacy SGPR pressure profile that benches verified at KV=64 (no -17% regression).True(opt-in hoist): the caller'sscalar_offsetis used directly + comptimepartition_offset_bytes. The runtime piece is computed ONCE at the call site and shared across the inner loop iterations. This is what_MlaKDmaPairat KV>=128 wants: one SGPR carries the hoisted base across both dma_nope+dma_rope, giving the +7% KV=128 lift.
Parameters:
- βhoist_scalar_offset (
Bool): Comptime flag selecting between legacy per-iter codegen (False, default) and the explicit caller-supplied hoisted base (True).
Args:
- βdst (
TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size]): Destination TileTensor in shared memory. - βsrc (
TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size]): Source TileTensor in global memory (warp sub-tile). - βscalar_offset (
Int): Wave-uniform byte offset ofsrcrelative to the buffer-resource base. Only consumed whenhoist_scalar_offsetisTrue; pass0(or any value β it is dead-code-eliminated) whenFalse. - βworker_base (
Int): Sub-tile row-strip index for cooperative half-sub-block loads (N-warps-per-subblock partition at depths < 128). When a caller splits aBM-row sub-block across N warps and passes each warp its ownM = BM/N-row strip, the loader's internalm_sub_tilecollapses to{0}and the swizzle would be computed as if the strip were the FIRST sub-row β dropping them_sub_tile * WARP_SIZEworker offset that the two-XORst_32x32_sswizzle needs (theSwizzle(1,0,6)bit-0 flip keys off worker bit 6). Pass the strip's absolute sub-row index here so the swizzle matches the consumer read (MhaMmaOp.load_K). Default 0 = full sub-block load (depth 128), unchanged.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!