For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
SubTileLoaderLDS
struct SubTileLoaderLDS[dtype: DType, swizzle: Optional[Swizzle] = Optional(), swizzle2: Optional[Swizzle] = Optional()]
DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads.
Sibling of TileLoaderLDS (warp-group cooperative coord-indexed).
This one issues one buffer_load_*_lds burst per .load() call for
a single source sub-tile. Attention's KV-cache warp DMA pattern:
each warp claims a (warp_tile_rows, BK) slice of a
(BN, K-span) DRAM tile and streams it into its SMEM lane.
The AMD buffer_load_*_lds intrinsic is emitted with the
amdgpu.AsyncCopies alias scope via rocdl.raw.ptr.buffer.load.lds
so consumer-side LDS reads tagged with
noalias_scopes=_alias_scope_attr (see ds_read_tr* at lines 96,
419-480) can skip s_waitcnt vmcnt(0) — LLVM PR #74537's
SIInsertWaitcnts vmcnt-relaxation handshake. Safe because
attention kernels also maintain an explicit
s_waitcnt vmcnt(0) + s_barrier fence at DMA/compute boundaries.
Why not stdlib load_to_lds[async_copies=True]: stdlib's
async_copies=True attaches its OWN alias_scope MLIR attribute
which is textually identical to _alias_scope_attr but an
MLIR-distinct object; ScopedNoAliasAA matches by identity, so
the DMA and LDS-consumer scopes don't match and the relaxation is
silently disabled (MLA regresses 0.76 abs at output[0,0,0,0] — the
same signature as b7b68a00290). Keeping the intrinsic emission
local to this file so producer + consumer share the exact same
_alias_scope_attr object. If stdlib ever exports the scope as a
shareable symbol, collapse this body to bc.load_to_lds[ async_copies=True].
Constructs the AMDBufferResource once from a DRAM tile (which may
carry Scalar valid_rows for bounds clamping via MixedLayout).
Each load() call reuses the descriptor — one shared bc per tile,
zero per-warp overhead for buffer resource construction. SRD bounds
computed by make_amd_buffer_resource via _get_bounds; hardware
clamps OOB reads to zero.
Parameters
- dtype (
DType): Element data type. - swizzle (
Optional[Swizzle]): Optional swizzle for bank conflict reduction. - swizzle2 (
Optional[Swizzle]): Optional second swizzle, applied AFTERswizzle. Use to compose two-XOR swizzles (e.g., HK'sst_32x32bit5^=bit9+bit4^=bit10byte-level pair, which are not expressible as a single Swizzle).
Fields
- bc (
AMDBufferResource): The 128-bit buffer resource descriptor for DRAM access.
Implemented traits
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methods
__init__
__init__(gmem_tile: TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]) -> Self
Create a loader from a DRAM tile.
The tile's layout carries the valid row count (via Scalar dim[0] in MixedLayout). make_amd_buffer_resource reads that dimension to compute the SRD size.
Args:
- gmem_tile (
TileTensor[dtype, address_space=gmem_tile.address_space, linear_idx_type=gmem_tile.linear_idx_type, element_size=gmem_tile.element_size]): The full DRAM tile from KVCacheIterator.
load
load(self, dst: TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size], src: TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size])
Load a warp sub-tile from DRAM to LDS.
The src tile should be a warp-sized sub-tile of the original DRAM tile. Offsets are computed relative to the bc's base pointer, so the src pointer must be within the original tile's address range.
Args:
- dst (
TileTensor[dtype, address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=dst.element_size]): Destination TileTensor in shared memory. - src (
TileTensor[dtype, address_space=src.address_space, linear_idx_type=src.linear_idx_type, element_size=src.element_size]): Source TileTensor in global memory (warp sub-tile).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!