For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo trait

TileLoader

DRAM→LDS DMA loader contract for tile_rows × tile_cols half-tiles.

Implementations cooperate as a warp group to fill an SMEM half-tile via buffer_load_*_lds. The kernel walks coords in (m_offset, k_offset) GEMM-space; the loader translates them to physical addresses internally. Conformers must be TrivialRegisterPassable so the kernel can pass them by value through closures.

Two conformers ship today:

TileLoaderLDS — linear 2D source. Used by matmul A/B operands and by conv's B (filter) operand. The address math is addr = (m_offset * stride) + k_offset.
TileLoaderLDSIm2col — NHWC + in-line im2col. Used by conv's A (input) operand. The address math decomposes m_offset → (n, h_out, w_out) and k_offset → (kh, kw, c) at load time; conv geometry (R, S, H, W, stride, dilation, pad) is loader-internal state.

The kernel doesn't have to know which loader is in use — it just advances (m_offset, k_offset) through the K-loop. That's the point of the trait: the conv body and matmul body can share everything except which loader they instantiate.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

`comptime` members

`dtype`

comptime dtype

`tile_cols`

comptime tile_cols

`tile_rows`

comptime tile_rows

Required methods

`init`

def __init__(out self, *, copy: Self)

Create a new instance of the value by copying an existing one.

Args:

copy (_Self): The value to copy.

Returns:

_Self

def __init__(out self, *, deinit move: Self)

Create a new instance of the value by moving the value of another.

Args:

move (_Self): The value to move.

Returns:

_Self

`load_tile`

def load_tile(self, dst: TileTensor[Self.dtype, address_space=AddressSpace.SHARED], m_offset: Int, k_offset: Int)

Loads a half-tile from global memory into the SMEM dst.

Issues num_iterations buffer_load_*_lds bursts (per lane) that together fill the tile_rows × tile_cols SMEM half-tile. Each iteration costs one vmcnt-tracked outstanding load per lane — the 4-wave software pipeline relies on this exact accounting.

Args:

dst (TileTensor[_Self.dtype, address_space=AddressSpace.SHARED]): Destination half-tile in SHARED address space, sized tile_rows × tile_cols.
m_offset (Int): Row offset (M dimension) of the sub-tile origin in GEMM space.
k_offset (Int): Column offset (K dimension) of the sub-tile origin in GEMM space.

Provided methods

`copy`

def copy(self) -> Self

Explicitly construct a copy of self, a convenience method for Self(copy=self) when the type is inconvenient to write out.

Overriding this method is not allowed.

Returns:

_Self: A copy of this value.

Implemented traits​

comptime members​

dtype​

tile_cols​

tile_rows​

Required methods​

__init__​

load_tile​

Provided methods​

copy​

Implemented traits

`comptime` members

`dtype`

`tile_cols`

`tile_rows`

Required methods

`init`

`load_tile`

Provided methods

`copy`