For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

TileLoaderLDS

struct TileLoaderLDS[dtype_: DType, tile_rows_: Int, tile_cols_: Int, stride: Int, num_loading_warps: Int, swizzle: Optional[Swizzle] = Optional(), load_width: Int = simd_width_of[dtype_](), use_full_tile_width: Bool = False]

DRAM→LDS DMA expert for warp-group cooperative coord-indexed loads.

Sibling of SubTileLoaderLDS (single-sub-tile TileTensor-indexed). This one coordinates a warp group (typically 8 warps) to cooperatively fill a half-tile via coord-indexed iteration: load_tile(dst, m_offset, k_offset) steps through num_iterations BK-wide rows, optionally applying a per-iteration byte-space swizzle for LDS bank-conflict avoidance. Matmul's DRAM→LDS pattern (ping-pong, etc.).

Uses stdlib AMDBufferResource.load_to_lds directly — no alias scope attached. Matmul's scheduling uses s_sched_group_barrier hints, which don't qualify as the runtime fence required by the SIInsertWaitcnts vmcnt-relaxation contract; attaching the scope would miscompile (see the async_copies docstring on load_to_lds). For attention patterns that DO satisfy the contract via explicit s_waitcnt vmcnt(0) + s_barrier fences, use SubTileLoaderLDS instead.

Parameters

dtype_ (DType): Element data type. Re-bound to dtype at body scope to match the TileLoader trait alias.
tile_rows_ (Int): Height of each half-tile to load. Re-bound to tile_rows at body scope.
tile_cols_ (Int): Width (K dimension) of each half-tile. Re-bound to tile_cols at body scope.
stride (Int): Row stride of the source GMEM tensor.
num_loading_warps (Int): Warps cooperating on each load (typically 8).
swizzle (Optional[Swizzle]): Optional byte-space swizzle for LDS bank conflicts.
load_width (Int): Elements per load (SIMD width).
use_full_tile_width (Bool): FP8 row-major mode.

Fields

buffer (AMDBufferResource):
thread_row (Int):
thread_col (Int):
warp_id (Int):
lane_id (Int):
m_anchor (Int):
k_anchor (Int):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TileLoader, TrivialRegisterPassable

`comptime` members

`dtype`

comptime dtype = dtype_

`elements_per_warp`

comptime elements_per_warp = (_resolve_warp_size() * load_width)

`lane_load_bytes`

comptime lane_load_bytes = (load_width * size_of[dtype_]())

`loading_threads`

comptime loading_threads = (num_loading_warps * _resolve_warp_size())

`num_iterations`

comptime num_iterations = ceildiv(tile_rows_, (Int((mul _resolve_warp_size(), num_loading_warps)) // (tile_cols_ // load_width)))

`num_warp_cols`

comptime num_warp_cols = (tile_cols_ // tile_cols_ if use_full_tile_width else Int(32))

`num_warp_rows`

comptime num_warp_rows = (num_loading_warps // (tile_cols_ // tile_cols_ if use_full_tile_width else Int(32)))

`row_bytes`

comptime row_bytes = (tile_cols_ * size_of[dtype_]())

`rows_per_iteration`

comptime rows_per_iteration = (Int((mul _resolve_warp_size(), num_loading_warps)) // (tile_cols_ // load_width))

`rows_per_warp`

comptime rows_per_warp = (Int((mul _resolve_warp_size(), load_width)) // tile_cols_)

`subtile_cols`

comptime subtile_cols = TileLoaderLDS[dtype_, tile_rows_, tile_cols_, stride, num_loading_warps, swizzle, load_width, use_full_tile_width].tile_cols if use_full_tile_width else Int(32)

`thread_rows`

comptime thread_rows = (_resolve_warp_size() // (tile_cols_ if use_full_tile_width else Int(32) // load_width))

`threads_per_row`

comptime threads_per_row = (tile_cols_ if use_full_tile_width else Int(32) // load_width)

`tile_cols`

comptime tile_cols = tile_cols_

`tile_rows`

comptime tile_rows = tile_rows_

`total_warp_rows`

comptime total_warp_rows = ceildiv(tile_rows_, (Int((mul _resolve_warp_size(), load_width)) // tile_cols_))

`warp_subtile_bytes`

comptime warp_subtile_bytes = (Int((mul (Int((mul _resolve_warp_size(), load_width)) // tile_cols_), tile_cols_)) * size_of[dtype_]())

Methods

`init`

def __init__(src: TileTensor[Self.dtype], warp_id: Int, lane_id: Int, *, m_anchor: Int = Int(0), k_anchor: Int = Int(0)) -> Self

Builds the loader.

Args:

src (TileTensor[Self.dtype]): GMEM tile to source from. Pass the full A/B tensor and set m_anchor/k_anchor to the per-block origin, or pass a pre-sliced block tile with zero anchors (legacy behavior). The full-tensor form lets the SRD's num_records bound the actual allocation rather than the block view — required for split-K kernels and for parity with TileLoaderLDSIm2col.
warp_id (Int): Warp identifier within the loading warp group.
lane_id (Int): Lane identifier within the warp.
m_anchor (Int): M-coordinate (row dim) of the block origin in the loader's SRD coordinate system. Added to m_offset at load time. Defaults to 0.
k_anchor (Int): K-coordinate (column dim) of the block origin. Added to k_offset at load time. Defaults to 0.

`load_tile`

def load_tile(self, dst: TileTensor[Self.dtype, address_space=AddressSpace.SHARED], m_offset: Int, k_offset: Int)

Loads a half-tile from GMEM into SMEM dst via load_to_lds.

The effective GEMM-space coordinate is (m_anchor + m_offset, k_anchor + k_offset), so callers using the legacy pre-sliced-block form (anchors=0) keep their address math unchanged.

Args:

dst (TileTensor[Self.dtype, address_space=AddressSpace.SHARED]): Destination TileTensor in SHARED (half-tile sized).
m_offset (Int): Row offset (M dim) within the block.
k_offset (Int): Column (K dim) offset within the block.

Parameters​

Fields​

Implemented traits​

comptime members​

dtype​

elements_per_warp​

lane_load_bytes​

loading_threads​

num_iterations​

num_warp_cols​

num_warp_rows​

row_bytes​

rows_per_iteration​

rows_per_warp​

subtile_cols​

thread_rows​

threads_per_row​

tile_cols​

tile_rows​

total_warp_rows​

warp_subtile_bytes​

Methods​

__init__​

load_tile​