For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_tile_io_conv

DRAM->LDS DMA loader for AMD implicit-GEMM convolution.

TileLoaderLDSIm2col is a sibling expert of structured_kernels.amd_tile_io.TileLoaderLDS under the same DRAM→LDS warp-cooperative direction. The split parallels the existing TileLoaderLDS (linear 2D source, coord-indexed) vs SubTileLoaderLDS (single sub-tile, TileTensor-indexed, async-copies alias scope) — same direction, different cooperation or source shape. Here the differentiator is the source layout type: this loader accepts a 4D NHWC input tensor and (m_offset, k_offset) coordinates in GEMM space (where M = N*H_out*W_out and K = R*S*C), translating them to NHWC linear offsets at load time. The per-lane vs uniform split of each address is preserved so each iteration still issues a single buffer_load_*_lds per lane, matching the vmcnt accounting amd_4wave_matmul's schedule relies on.

For each lane each iteration, load_tile computes

(n, h_out, w_out) = decompose(m_offset + lane_offset, H_out, W_out)
(kh, kw, c)       = decompose(k_offset + lane_offset, R, S, C)
h_in              = h_out * stride_h + kh * dilation_h - pad_h
w_in              = w_out * stride_w + kw * dilation_w - pad_w
addr              = ((n*H + h_in)*W + w_in)*C + c

and emits one buffer_load_*_lds per lane targeting addr. Three comptime sub-paths inside the body, picked at struct instantiation:

Pure-pointwise fast path (R=S=1, stride=1, dilation=1, pad=0): math collapses to m*C + k, identical to TileLoaderLDS(stride = C). Same instruction stream as the matmul loader; same VGPR usage.
Uniform-substrip path (general R×S, tile_cols ≤ C and C % tile_cols == 0): each load_tile call's lanes all sit in one (kh, kw) substrip; (kh, kw, c_base) are computed once per call from k_offset. Address-decomp adds the per-lane (n, h_out, w_out) divmods.
Per-lane substrip path (BK > C or non-aligned, e.g. ResNet stem with C_in = 64 and BK = 128): each lane independently decomposes its k_lane = k_offset + thread_col. Adds two more divmods per lane but unblocks shapes the fast path can't cover.

Halo (pad > 0) is handled by routing OOB lanes (h_in or w_in outside [0, H)/[0, W)) to the SRD's num_records sentinel — the AMD buffer-resource hardware bound check then clamps the read to 0. Gated by a _needs_halo_mask comptime flag so pad=0 callers keep their exact instruction stream.

The SRD covers the full NHWC tensor (N*H*W*C elements), not a per-block slice — so OOB reads (deliberate halo or otherwise) are bounded by the actual allocation, in contrast to TileLoaderLDS which sizes the SRD to the per-block .tile[BM, K]() view. The full-tensor pattern caps at 4 GiB (AMDBufferResource.num_records is a 32-bit byte count); fine for typical NHWC conv inputs but watch the cap when NHW*C grows.

K-padding is the caller's responsibility (the 4-wave schedule requires K_filter % (2*BK) == 0). Set C to the real input channel count and pre-pad the filter's trailing K columns with zeros; the address math naturally handles the padded K reads via SRD-OOB or filter-zero multiplies in the MMA.

Structs

TileLoaderLDSIm2col: DRAM->LDS DMA expert for implicit-GEMM convolution, NHWC inputs.

Structs​

Structs